0% found this document useful (0 votes)
185 views113 pages

Aca 3

The document discusses advanced computer architecture topics including instruction pipelining, pipeline hazards, and techniques for exploiting instruction-level parallelism. It covers pipeline stages and speedup, structural hazards, data hazards like forwarding and stalls, control hazards from branches, and limiting factors on parallelism from dependencies between instructions. It aims to explain how pipelining improves performance and the challenges of maintaining correct execution order in the presence of hazards.

Uploaded by

Prateek Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views113 pages

Aca 3

The document discusses advanced computer architecture topics including instruction pipelining, pipeline hazards, and techniques for exploiting instruction-level parallelism. It covers pipeline stages and speedup, structural hazards, data hazards like forwarding and stalls, control hazards from branches, and limiting factors on parallelism from dependencies between instructions. It aims to explain how pipelining improves performance and the challenges of maintaining correct execution order in the presence of hazards.

Uploaded by

Prateek Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

Advanced Computer Architecture

CSD-411

Department of Computer Science and Engineering


National Institute of Technology Hamirpur
Hamirpur, Himachal Pradesh - 177005
ABOUT ME: DR. MOHAMMAD AHSAN

• PhD – National Institute of Technology


Hamirpur (H.P.)
• M.Tech – National Institute of Technology
Hamirpur (H.P.)
• Qualified UGC NET June-2015 and UGC NET
Nov-2017 for Assistant Professor
• Qualified GATE 2012, GATE 2013 and GATE
2021.
• Experience: NIT Hamirpur and NIT Andhra
Pradesh.
Advanced Computer Architecture

•.
Advanced Computer Architecture

• Graphical representation of the instruction pipeline.


Advanced Computer Architecture

• All the pipeline stages take a single clock cycle.


Advanced Computer Architecture

• Performance evaluation of Pipeline Processor


Advanced Computer Architecture

•.
Advanced Computer Architecture

•.
Advanced Computer Architecture

•.
Advanced Computer Architecture

•.
Advanced Computer Architecture

•.
Advanced Computer Architecture

• Consider 4 segment pipeline with the respective stage delays of 10ns, 20ns,
5ns and 15ns. What is the appropriate speedup when the very large number
of instructions are pipelined?
Advanced Computer Architecture

Pipeline Hazards:
• There are situations in pipelining when the next instruction cannot execute
in the following clock cycle. These events are called hazards.
• There are three types of hazards:
i. Structural Hazard
ii. Data Hazards
iii. Control Hazards
Advanced Computer Architecture

Structural Hazard:
• If the hardware can not support the combination of instructions that we
want to execute in the same clock cycle because of resource conflicts, the
processor is said to have a structural hazard.
Advanced Computer Architecture

• Conflict is an unsuccessful operation.


Advanced Computer Architecture

•.
Advanced Computer Architecture

Data Hazards
• Data hazards occur when the pipeline changes the order of read/write
accesses to operands so that the order differs from the order seen by
sequentially executing instructions on an unpipelined processor.
• Example – the pipelined execution of these instructions:
• DADD R1,R2,R3
• DSUB R4,R1,R5
• AND R6,R1,R7
• OR R8,R1,R9
• XOR R10,R1,R11
Advanced Computer Architecture

•.
Advanced Computer Architecture

• To detect the data dependency condition a database is maintained at the


decoding stage. The database contain the following fields:
Advanced Computer Architecture

Minimizing Data hazard Stalls by Forwarding


• The data hazard problem can be solved with a simple hardware technique
called forwarding (also called bypassing and sometimes short-circuiting).
• The key insight in forwarding is that the result is not really needed by the
DSUB until after the DADD actually produces it. If the result can be
moved from the pipeline register where the DADD stores it to where the
DSUB needs it, then the need for a stall can be avoided.
• Forwarding works as follows:
• If the forwarding hardware detects that the previous ALU operation has written the
register corresponding to a source for the current ALU operation, control logic
selects the forwarded result as the ALU input rather than the value read from the
register file.
Advanced Computer Architecture
• Operand Forwarding: A result is forwarded from the pipeline register
corresponding to the output of one unit to the input of another.
Advanced Computer Architecture

•.
Advanced Computer Architecture

Data Hazards requiring Stalls


• Unfortunately, not all potential data hazards can be handled by bypassing.
• Consider the following sequence of instructions:
• LD R1,0(R2)
• DSUB R4,R1,R5
• AND R6,R1,R7
• OR R8,R1,R9
Advanced Computer Architecture

•.
Advanced Computer Architecture

• The load instruction has a delay or latency that cannot be eliminated by


forwarding alone. Instead, we need to add hardware, called a pipeline interlock,
to preserve the correct execution pattern. In general, a pipeline interlock detects
a hazard and stalls the pipeline until the hazard is cleared.
Advanced Computer Architecture

Control Hazards
• Control hazards can cause a greater performance loss for our MIPS
pipeline than do data hazards. When a branch is executed, it may or may
not change the PC to something other than its current value plus 4.
• During the execution of branch instructions the control is transferred from
location to another location.
• When the program is executed in the non-overlapping order, due to branch
instruction there is no functionality missing.
• When the program is executed in the pipeline, due to the branch operation some
functionality will be missing.
• To make it correct unwanted instructions must be flush out from the pipeline. This
creates the stalls in the pipeline.
Advanced Computer Architecture

• Control Hazards: Unconditional branch


Advanced Computer Architecture

• Control Hazards: Conditional branch


Advanced Computer Architecture

• Processors use pipelining to overlap the execution of instructions and


improve performance. This potential overlap among instructions is called
instruction-level parallelism (ILP).
• There are two largely separable approaches to exploiting ILP:
1. an approach that relies on hardware to help discover and exploit the parallelism
dynamically, and
2. an approach that relies on software technology to find parallelism statically at
compile time.
Advanced Computer Architecture

• Features of both programs and processors that limit the amount of


parallelism that can be exploited among instructions, as well as the critical
mapping between program structure and hardware structure, which is key
to understanding whether a program property will actually limit
performance and under what circumstances.
• The value of the CPI (cycles per instruction) for a pipelined processor is
the sum of the base CPI and all contributions from stalls:
Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls
Advanced Computer Architecture

• The simplest and most common way to increase the ILP is loop-level
parallelism.
• Example: for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];
• Techniques for converting such loop-level parallelism into instruction-
level parallelism:
• In some SIMD architecture where four data items are processed per instruction, the
above code sequence might get execute in one-quarter of total instructions.
• On some vector processors, this sequence might take only four instructions: two
instructions to load the vectors x and y from memory, one instruction to add the
two vectors, and an instruction to store back the result vector.
Advanced Computer Architecture
• To exploit instruction-level parallelism we must determine which
instructions can be executed in parallel.
• If two instructions are parallel, they can execute simultaneously in a pipeline
without causing any stalls, assuming the pipeline has sufficient resources (and
hence no structural hazards exist).
• If two instructions are dependent, they are not parallel and must be executed in
order.
• Types of dependences:
i. data dependences,
ii. name dependences, and
iii. control dependences
Advanced Computer Architecture

• An instruction j is data dependent on instruction i if either of the following


holds:
• Instruction i produces a result that may be used by instruction j.
• Instruction j is data dependent on instruction k, and instruction k is data dependent
on instruction i.
Advanced Computer Architecture

• A data dependence conveys three things:


i. the possibility of a hazard,
ii. the order in which results must be calculated, and
iii. an upper bound on how much parallelism can possibly be exploited.
• A data dependence can limit the amount of instruction-level parallelism
that we can exploit.
• A dependence can be overcome in two different ways:
1) maintaining the dependence but avoiding a hazard, and
2) eliminating a dependence by transforming the code.
Advanced Computer Architecture

• A name dependence occurs when two instructions use the same register or
memory location, called a name, but there is no flow of data between the
instructions associated with that name.
• There are two types of name dependences between an instruction i that
precedes instruction j in program order:
1) An anti-dependence between instruction i and instruction j occurs when
instruction j writes a register or memory location that instruction i reads. The
original ordering must be preserved to ensure that i reads the correct value.
2) An output dependence occurs when instruction i and instruction j write the same
register or memory location. The ordering between the instructions must be
preserved to ensure that the value finally written corresponds to instruction j.
Advanced Computer Architecture

• Instructions involved in a name dependence can execute simultaneously


or be reordered, if the name (register number or memory location) used in
the instructions is changed so the instructions do not conflict.
• This renaming can be more easily done for register operands, where it is
called register renaming. Register renaming can be done either statically
by a compiler or dynamically by the hardware.
Advanced Computer Architecture

Data Hazards
• A hazard exists whenever there is a name or data dependence between
instructions, and they are close enough that the overlap during execution
would change the order of access to the operand involved in the
dependence.
• The goal of both our software and hardware techniques is to exploit
parallelism by preserving program order only where it affects the outcome
of the program.
• Detecting and avoiding hazards ensures that necessary program order is
preserved.
Advanced Computer Architecture

• Data hazards may be classified as RAW, WAW, and WAR, depending on


the order of read and write accesses in the instructions. These hazards are
named by the ordering in the program that must be preserved by the
pipeline.
• RAW (read after write): j tries to read a source before i writes it, so j incorrectly
gets the old value (a true data dependence). Program order must be preserved to
ensure that j receives the value from i.
• WAW (write after write): j tries to write an operand before it is written by i (an
output dependence). WAW hazards are present only in pipelines that write in more
than one pipe stage or allow an instruction to proceed even when a previous
instruction is stalled.
• WAR (write after read): j tries to write a destination before it is read by i, so i
incorrectly gets the new value (an anti-dependence or name dependence).
Advanced Computer Architecture

• A control dependence determines the ordering of an instruction, i, with


respect to a branch instruction so that instruction i is executed in correct
program order and only when it should be.
• Example: if p1 {
S1;
};
if p2 {
S2;
}
• In general, two constraints are imposed by control dependences:
1) An instruction that is control dependent on a branch cannot be moved before the
branch.
2) An instruction that is not control dependent on a branch cannot be moved after the
branch
Advanced Computer Architecture

• Control dependence is not the critical property that must be preserved.


• Properties critical to program correctness: exception behavior & data flow.
Advanced Computer Architecture

• The data flow is the actual flow of data values among instructions that
produce results and those that consume them. Branches make the data flow
dynamic, since they allow the source of data for a given instruction to
come from many points.

• Data dependence alone is insufficient to preserve correctness. When the


instructions execute, the data flow must be preserved.
Advanced Computer Architecture

• Sometimes we can determine that violating the control dependence cannot


affect either the exception behavior or the data flow.

• If the branch is taken, the DSUBU instruction will execute and will be
useless, but it will not affect the program results.
• This type of code scheduling is also a form of speculation, often called
software speculation, since the compiler is betting on the branch outcome.
Advanced Computer Architecture
Basic Compiler Techniques for Exposing ILP
• These techniques are crucial for processors that use static issue or static
scheduling.
• Basic Pipeline Scheduling and Loop Unrolling
• To keep a pipeline full, parallelism among instructions must be exploited by
finding sequences of unrelated instructions that can be overlapped in the pipeline.
• To avoid a pipeline stall, the execution of a dependent instruction must be
separated from the source instruction by a distance in clock cycles equal to the
pipeline latency of that source instruction.
• A compiler’s ability to perform this scheduling depends both on the amount of ILP
available in the program and on the latencies of the functional units in the pipeline.
Advanced Computer Architecture

How the compiler can increase the amount of available ILP by transforming
loops?
• Example:
Advanced Computer Architecture

• Without any scheduling, the loop will execute as follows:

• With scheduling:
Advanced Computer Architecture

• Loop unrolling – replicate the loop body multiple times and adjust the
loop termination code.

Advanced Computer Architecture

• The execution time of the unrolled loop has dropped to 3.5 clock cycles
per element, compared with 9 cycles per element before any unrolling or
scheduling and 7 cycles when scheduled but not unrolled.
Advanced Computer Architecture

• To obtain the final unrolled code we had to make the following decisions
and transformations:
• Determine that unrolling the loop would be useful by finding that the loop
iterations were independent, except for the loop maintenance code.
• Use different registers to avoid unnecessary constraints that would be forced by
using the same registers for different computations (e.g., name dependences).
• Eliminate the extra test and branch instructions and adjust the loop termination and
iteration code.
• Determine that the loads and stores in the unrolled loop can be interchanged by
observing that the loads and stores from different iterations are independent.
• Schedule the code, preserving any dependences needed to yield the same result as
the original code.
Advanced Computer Architecture

• Three different effects limit the gains from loop unrolling:


1) a decrease in the amount of overhead amortized with each unroll,
2) code size limitations, and
3) compiler limitations.
Advanced Computer Architecture

Reducing Branch Costs with Advanced Branch Prediction


• Loop unrolling is one way to reduce the number of branch hazards.
• How to reduce the performance losses of branches?
• With the help of branch predictor.
• A branch predictor is a digital circuit that tries to guess which way a
branch will go before this is known definitely.
• Purpose of branch predictor - improving flow in the instruction pipeline.
Advanced Computer Architecture

• Two-way branching is usually implemented with a conditional jump


instruction.
• Conditional jump
Advanced Computer Architecture

• Without branch prediction, the processor would have to wait until the
conditional jump instruction has passed the execute stage before the next
instruction can enter the fetch stage in the pipeline.
• The branch predictor attempts to avoid this waste of time by trying to
guess whether the conditional jump is most likely to be taken or not taken.
• The branch that is guessed to be the most likely taken is then fetched and
speculatively executed.
• Branch predictor keeps records of whether branches are taken or not.
When it encounters a conditional jump that has been seen several times
before, it base the prediction on this history.
Advanced Computer Architecture

• Static branch prediction: all decisions are made at compile time, before
the execution of the program. It always predict that a conditional jump will
not be taken, so always fetch the next sequential instruction.
• Dynamic branch prediction: It uses information about taken or not taken
branches gathered at run-time to predict the outcome of a branch.
Advanced Computer Architecture

Overcoming Data Hazards with Dynamic Scheduling


• Dynamic Scheduling:
• Limitation of simple pipelining techniques - Instructions are issued in program
order, and if an instruction is stalled in the pipeline no later instructions can
proceed.

• The SUB.D instruction cannot execute because the dependence of ADD.D


on DIV.D causes the pipeline to stall; yet, SUB.D is not data dependent on
anything in the pipeline.
Advanced Computer Architecture

• To allow out-of-order execution, we essentially split the ID pipe stage of


our simple five-stage pipeline into two stages:
1) Issue—Decode instructions, check for structural hazards.
2) Read operands—Wait until no data hazards, then read operands.
• In a dynamically scheduled pipeline, all instructions pass through the issue
stage in order (in-order issue); however, they can be stalled or bypass each
other in the second stage (read operands) and thus enter execution out of
order.
Advanced Computer Architecture
Dynamic Scheduling Using Tomasulo’s Approach
• This scheme, invented by Robert Tomasulo, tracks when operands for
instructions are available to minimize RAW hazards and introduces
register renaming in hardware to minimize WAW and WAR hazards.
• Register renaming:
Advanced Computer Architecture

How renaming occurs?


• In Tomasulo’s scheme, register renaming is provided by reservation
stations, which buffer the operands of instructions waiting to issue.
• The information held in the reservation stations at each functional unit
determines when an instruction can begin execution at that unit.
• Results are passed directly to functional units from the reservation stations
where they are buffered, rather than going through the registers.
Advanced Computer Architecture

•.
Advanced Computer Architecture

Instruction status:
• Issue –
• Get the next instruction from the head of the instruction queue (maintained in FIFO
order).
• If there is a matching reservation station that is empty, issue the instruction to the
station with the operand values, if they are currently in the registers.
• If there is not an empty reservation station, then there is a structural hazard and the
instruction stalls until a station or buffer is freed.
• If the operands are not in the registers, keep track of the functional units that will
produce the operands. This step renames registers, eliminating WAR and WAW
hazards.
Advanced Computer Architecture

Instruction status:
• Execute –
• If one or more of the operands is not yet available, monitor the common data bus
while waiting for it to be computed.
• When an operand becomes available, it is placed into any reservation station
awaiting it.
• When all the operands are available, the operation can be executed at the
corresponding functional unit.
• By delaying instruction execution until the operands are available, RAW
hazards are avoided.
Advanced Computer Architecture

Instruction status:
• Write result –
• When the result is available, write it on the CDB and from there into the registers
and into any reservation stations (including store buffers) waiting for this result.
• Stores are buffered in the store buffer until both the value to be stored and the store
address are available, then the result is written as soon as the memory unit is free.
• The combination of the common result bus and the retrieval of results from
the bus by the reservation stations implements the forwarding and
bypassing mechanisms used in a statically scheduled pipeline.
• However, a dynamically scheduled scheme introduces one cycle of latency
between source and result, since the matching of a result and its use cannot
be done until the Write Result stage.
Advanced Computer Architecture
Each reservation station has seven fields:
• Op—The operation to perform on source operands S1 and S2.
• Qj, Qk—The reservation stations that will produce the corresponding source
operand; a value of zero indicates that the source operand is already available
in Vj or Vk, or is unnecessary.
• Vj, Vk—The value of the source operands. Note that only one of the V fields
or the Q field is valid for each operand. For loads, the Vk field is used to hold
the offset field.
• A—Used to hold information for the memory address calculation for a load
or store. Initially, the immediate field of the instruction is stored here; after
the address calculation, the effective address is stored here.
• Busy—Indicates that this reservation station and its accompanying functional
unit are occupied.
Advanced Computer Architecture

Register file:
• Qi—The number of the reservation station that contains the operation
whose result should be stored into this register. If the value of Qi is blank
(or 0), no currently active instruction is computing a result destined for this
register, meaning that the value is simply the register contents.
Advanced Computer Architecture

•.
Advanced Computer Architecture

Exploiting ILP Using Multiple Issue and Static Scheduling


• To allow multiple instructions to issue in a clock cycle.
• Multiple-issue processors come in three major flavors:
1) Statically scheduled superscalar processors
2) Dynamically scheduled superscalar processors
3) VLIW (very long instruction word) processors
Advanced Computer Architecture

VLIW Approach
• VLIWs use multiple, independent functional units.
• Rather than attempting to issue multiple, independent instructions to the
units, a VLIW packages the multiple operations into one very long
instruction.
Advanced Computer Architecture
• A VLIW processor with instructions that contain five operations, including
one integer operation (which could also be a branch), two floating-point
operations, and two memory references.
Advanced Computer Architecture

Advanced Techniques for Instruction Delivery and Speculation


• Aim: to deliver a high-bandwidth instruction stream. In recent multiple-
issue processors, this has meant delivering 4 to 8 instructions every clock
cycle.
• Increasing Instruction Fetch Bandwidth
• A multiple-issue processor requires that the average number of instructions fetched
every clock cycle be at least as large as the average throughput.
Advanced Computer Architecture

Branch-Target Buffers
• A cache that stores the predicted address for the next instruction.
• Branch penalty – to reduce the branch penalty of a pipeline, we must
know whether the as-yet-undecoded instruction is a branch and, if so, what
the next program counter (PC) should be.
• If the PC of the fetched instruction matches an address in the prediction
buffer, then the corresponding predicted PC is used as the next PC.
Advanced Computer Architecture

•.
Advanced Computer Architecture

•.
Advanced Computer Architecture

• Penalties for all possible combinations – assuming only taken branches


are stored in the buffer.
Advanced Computer Architecture
Integrated Instruction Fetch Units
• To meet the demands of multiple-issue processors, an integrated instruction
fetch unit is implemented by the designers as a separate autonomous unit that
feeds instructions to the rest of the pipeline.
• This unit integrates these functions:
• Integrated branch prediction – to drive the fetch pipeline, the branch predictor
constantly predicts branches.
• Instruction prefetch – to deliver multiple instructions per clock, the unit
autonomously manages the prefetching of instructions.
• Instruction memory access and buffering – when fetching multiple instructions
per cycle a variety of complexities are encountered.
• Complexity – fetching multiple instructions may require accessing multiple cache
lines.
• Instruction fetch unit encapsulates this complexity using prefetch.
• The instruction fetch unit also provides buffering – act as on-demand unit to provide
instructions to the issue stage as needed and in the quantity needed.
Advanced Computer Architecture

Speculation
• Advantage:
• ability to uncover the events that can stall the pipeline early such as cache misses.
• Disadvantage:
• It takes time and energy, and the recovery of incorrect speculation further reduces
performance.
• To support the higher instruction execution rate needed to benefit from speculation,
the processor must have additional resources, which take silicon area and power.
Advanced Computer Architecture

Speculation and the Challenge of Energy Efficiency


• Wrong speculation consumes excess energy in two ways:
• The instructions that were speculated and whose results were not needed generated
excess work for the processor, wasting energy.
• Undoing the speculation and restoring the state of the processor to continue
execution at the appropriate address consumes additional energy that would not be
needed without speculation.
• If speculation lowers the execution time by more than it increases the
average power consumption, then the total energy consumed may be less.
• On the basis of fraction of instructions that get executed from
misspeculation, designers could avoid speculation or think about new
approaches, such as speculating on highly predictable branches.
Advanced Computer Architecture

Value Prediction
• It is a technique for increasing the amount of ILP available in a program.
• It attempts to predict the value that will be produced by an instruction.
• Limited success – most instructions likely produce a different value every
time they are executed.
• It is useful when an instruction produces a value chosen from a small set of
potential values (predict resulting value by correlating it with other
program behavior) or loads a value that changes infrequently.
• The results of value prediction are not sufficiently attractive to justify their
incorporation in real processors.
Advanced Computer Architecture

Multicores, Multiprocessors, and Clusters


• Multiprocessor: A computer system with at least two processors.
• Improve availability: if a single processor fails in a multiprocessor with n
processors, remaining n-1 processors provide the service.
• Job-level parallelism: utilizing multiple processors by running
independent programs simultaneously.
• Parallel processing program: a single program that runs on multiple
processors simultaneously.
• Multicore microprocessor: a microprocessor containing multiple
processors (“cores”) in a single integrated circuit.
Advanced Computer Architecture

• Cluster: a set of computers connected over a local area network (LAN)


that functions as a single large multiprocessor.
Advanced Computer Architecture

• Why is it difficult to write parallel processing programs that are fast,


especially as the number of processors increases?
• The challenges include scheduling, load balancing, time for
synchronization, and overhead for communication between the parties.
• It is analogous to the situation when n reporters trying to write a single
story in hopes of doing the work n times faster.
• To succeed, the task must be broken into n equal-sized pieces.
• Performance danger – when reporters spend too much time communicating with
each other instead of writing their pieces of the story.
Advanced Computer Architecture
Speed-up Challenge: Bigger Problem
Advanced Computer Architecture

•.
Advanced Computer Architecture
Speed-up Challenge: Balancing Load
Advanced Computer Architecture

•.
Advanced Computer Architecture

Shared Memory Multiprocessors


• All processors share a single physical address space.
• All variables of a program can be made available at any time to any processor.
• All processors capable of accessing any memory location via loads and stores.
• Two types:
1. Uniform memory access (UMA) multiprocessors – all main memory
access take same amount of time. It does not matter which processor
requests a memory access and which word is requested.
2. Nonuniform memory access (NUMA) multiprocessors – some memory
access are faster than others, depending on which processor asks for
which word.
Advanced Computer Architecture

• Processors need to coordinate when operating on shared data.


• Mechanism for synchronization – Lock.
• Only one processor at a time can acquire the lock, and other processors
interested in shared data have to wait until the original processor unlocks
the variable.
Advanced Computer Architecture

Message-Passing Multiprocessors
• Each processor has private physical address space.
• Coordination is built with message passing (i.e. explicitly sending and
receiving information).
Advanced Computer Architecture

• Message-passing networks offer better communication performance than


clusters built using local area networks.
• The problem with message-passing networks is their cost, they are much
more expensive and only few applications could justify the higher
communication performance, given the much higher costs.
Advanced Computer Architecture

Cluster
• Example of message-passing parallel computer.
• Collections of computers that are connected to each other over their I/O
interconnect via standard network switches and cables.
• Each runs a distinct copy of the operating system.

Advanced Computer Architecture

Drawbacks of clusters
• The cost of administering a cluster of n machines is about the same as the
cost of administering n independent machines, while the cost of
administering a shared memory multiprocessor with n processors is about
the same as administering a single machine.
• Processors in a cluster are usually connected using the I/O interconnect of
each computer, whereas the cores in a multiprocessor are usually
connected on the memory interconnect of the computer.
Advanced Computer Architecture

• A cluster of n machines has n independent memories and n copies of the


operating system, but a shared memory multiprocessor allows a single
program to use almost all the memory in the computer, and it only needs a
single copy of the operating system.
Advanced Computer Architecture

• Example: Memory Efficiency


• Suppose a single shared memory processor has 20 GB of main memory,
five clustered computers each have 4 GB, and the OS occupies 1 GB. How
much more space is there for users with shared memory?

Advanced Computer Architecture

Advantages of cluster:
• A cluster consists of independent computers connected through a local area
network, it is easier to replace a machine in a cluster than in a SMP.
• Clusters are constructed from whole computers and independent, scalable
networks, this isolation also makes it easier to expand the system without
bringing down the application that runs on top of the cluster.
• Lower cost, high availability, fault tolerance, and rapid incremental
expandability make clusters attractive to service providers for the World
Wide Web.
Advanced Computer Architecture

Multiprocessor Network Topologies


• Multicore chips require networks on chips to connect cores together.
• Network costs includes the number of switches, the number of links on a
switch to connect to the network, and the length of the links.
• Network performance includes the latency on an unloaded network to send
and receive a message, the throughput, delays, and variable performance
depending on the pattern of communication.
Advanced Computer Architecture

Ring Topology

• In the ring network, with P processors, the total network bandwidth would
be P times the bandwidth of one link.
Advanced Computer Architecture

Fully connected network


• Every processor has a bidirectional link to every other processor.
• Total network bandwidth is P*(P-1)/2.
• The tremendous improvement in performance of fully connected networks
is offset by the tremendous increase in cost.
Advanced Computer Architecture

Multistage Network
• A network that supplies a small switch at each node.
• Switches are smaller than processor-memory-switch nodes, and thus may
be packed more densely for lessening distance and increasing performance.
• Two popular multistage organizations:
1. Crossbar network
2. Omega network
Advanced Computer Architecture


Advanced Computer Architecture
Multiprocessing and Multithreading
• Initial computer performance improvements came from use of:
• Innovative manufacturing techniques
• Advancement of VLSI technology
• In later years,
• Most improvements came from exploitation of ILP.
• Both software and hardware techniques are being used.
• Pipelining, dynamic instruction scheduling, out of order execution, VLIW,
vector processing, etc.
• ILP now appears fully exploited
• Modern multiple-issue processors have become incredibly complex and
processor performance improvement through increasing complexity,
increasing silicon and increasing power seem to be diminishing.
Advanced Computer Architecture

Thread and Process-level Parallelism


• The way to achieve higher performance:
• Of late, exploitation of thread and process-level parallelism is being focused.
• Exploit parallelism existing across multiple processes or threads:
• Can not be exploited by any ILP processor.
• Consider a banking application:
• Individual transactions can be executed in parallel.
Advanced Computer Architecture

Process versus Threads


• Processes:
• A process is a program in execution.
• An application normally consists of multiple processes.
• Threads:
• A process consists of one or more threads.
• Threads belonging to the same process share data, code, and files.
Advanced Computer Architecture

Single and Multithreaded Processes


Advanced Computer Architecture

User Threads
• Thread management done in user space.
• User threads are supported and managed without kernel support.
• Invisible to the kernel.
• If one thread blocks, entire process blocks.
• Limited benefits of threading.
Kernel Threads
• Kernel threads supported and managed directly by the OS.
• Kernel creates Light Weight Processes (LWPs).
• Most OS support kernel threads: Linux, Mac OS, Solaris, Windows.
Advanced Computer Architecture

Benefits of Threading
• Responsiveness:
• Threads share code and data.
• Thread creation and switching therefore much more efficient than that for
processes.
• As an example in Solaris:
• Creating threads 30 times less costly than processors.
• Context switching about 5 times faster than processes.
• Truly concurrent execution:
• Possible with processors supporting concurrent execution threads: SMP, multicore,
SMT, etc.
Advanced Computer Architecture

A case for Processor Support for Thread-level Parallelism


• Using pure ILP, execution unit utilization is only about 20%-25%:
• Utilization limited by control dependency, cache misses during memory
access, etc.
• It is rare for units to be even reasonably busy on the average.
• In pure ILP:
• At any time only one thread is under execution.
• Utilization of execution units can be improved:
• Have several threads under execution:
• Called active threads in PentiumIII.
• Execute several threads at the same time.
• SMP, SMT, and multicore processors.
Advanced Computer Architecture

Thread Examples
• Independent threads occur naturally in several applications:
• Web Server: different http requests are threads.
• File Server
• Banking: independent transactions
• Desktop applications: file loading, display, computations, etc. can be
threads.
Advanced Computer Architecture

Using ILP Support to Exploit Thread-level Parallelism


• Possible processor configurations:
• A superscalar with no multithreading support.
• A superscalar with coarse-grained multithreading.
• It suggests switching between threads only after significant events, such as a
cache miss.
• A superscalar with fine-grained multithreading.
• It suggests switching between threads after every instruction.
• A superscalar with simultaneous multithreading.
• It lowers the cost of multithreading by utilizing the resources needed for
multiple issue, dynamically schedule microarchitecture.
Advanced Computer Architecture
Advanced Computer Architecture

Thread-level Parallelism: Cons


• Threads have to be identified by the programmer:
• No rules exist as to what can be a meaningful thread.
• Threads can not possibly be identified by any automatic static or dynamic
analysis of code.
• Burden on programmer: requires careful thinking and programming.
• Threads with severe dependencies:
• May make multithreading an exercise in futility (not worthwhile).
• Also not as “programmer friendly” as ILP.
Advanced Computer Architecture
Advanced Computer Architecture
Advanced Computer Architecture
Advanced Computer Architecture
Advanced Computer Architecture

You might also like