Aca 3
Aca 3
CSD-411
•.
Advanced Computer Architecture
•.
Advanced Computer Architecture
•.
Advanced Computer Architecture
•.
Advanced Computer Architecture
•.
Advanced Computer Architecture
•.
Advanced Computer Architecture
• Consider 4 segment pipeline with the respective stage delays of 10ns, 20ns,
5ns and 15ns. What is the appropriate speedup when the very large number
of instructions are pipelined?
Advanced Computer Architecture
Pipeline Hazards:
• There are situations in pipelining when the next instruction cannot execute
in the following clock cycle. These events are called hazards.
• There are three types of hazards:
i. Structural Hazard
ii. Data Hazards
iii. Control Hazards
Advanced Computer Architecture
Structural Hazard:
• If the hardware can not support the combination of instructions that we
want to execute in the same clock cycle because of resource conflicts, the
processor is said to have a structural hazard.
Advanced Computer Architecture
•.
Advanced Computer Architecture
Data Hazards
• Data hazards occur when the pipeline changes the order of read/write
accesses to operands so that the order differs from the order seen by
sequentially executing instructions on an unpipelined processor.
• Example – the pipelined execution of these instructions:
• DADD R1,R2,R3
• DSUB R4,R1,R5
• AND R6,R1,R7
• OR R8,R1,R9
• XOR R10,R1,R11
Advanced Computer Architecture
•.
Advanced Computer Architecture
•.
Advanced Computer Architecture
•.
Advanced Computer Architecture
Control Hazards
• Control hazards can cause a greater performance loss for our MIPS
pipeline than do data hazards. When a branch is executed, it may or may
not change the PC to something other than its current value plus 4.
• During the execution of branch instructions the control is transferred from
location to another location.
• When the program is executed in the non-overlapping order, due to branch
instruction there is no functionality missing.
• When the program is executed in the pipeline, due to the branch operation some
functionality will be missing.
• To make it correct unwanted instructions must be flush out from the pipeline. This
creates the stalls in the pipeline.
Advanced Computer Architecture
• The simplest and most common way to increase the ILP is loop-level
parallelism.
• Example: for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];
• Techniques for converting such loop-level parallelism into instruction-
level parallelism:
• In some SIMD architecture where four data items are processed per instruction, the
above code sequence might get execute in one-quarter of total instructions.
• On some vector processors, this sequence might take only four instructions: two
instructions to load the vectors x and y from memory, one instruction to add the
two vectors, and an instruction to store back the result vector.
Advanced Computer Architecture
• To exploit instruction-level parallelism we must determine which
instructions can be executed in parallel.
• If two instructions are parallel, they can execute simultaneously in a pipeline
without causing any stalls, assuming the pipeline has sufficient resources (and
hence no structural hazards exist).
• If two instructions are dependent, they are not parallel and must be executed in
order.
• Types of dependences:
i. data dependences,
ii. name dependences, and
iii. control dependences
Advanced Computer Architecture
• A name dependence occurs when two instructions use the same register or
memory location, called a name, but there is no flow of data between the
instructions associated with that name.
• There are two types of name dependences between an instruction i that
precedes instruction j in program order:
1) An anti-dependence between instruction i and instruction j occurs when
instruction j writes a register or memory location that instruction i reads. The
original ordering must be preserved to ensure that i reads the correct value.
2) An output dependence occurs when instruction i and instruction j write the same
register or memory location. The ordering between the instructions must be
preserved to ensure that the value finally written corresponds to instruction j.
Advanced Computer Architecture
Data Hazards
• A hazard exists whenever there is a name or data dependence between
instructions, and they are close enough that the overlap during execution
would change the order of access to the operand involved in the
dependence.
• The goal of both our software and hardware techniques is to exploit
parallelism by preserving program order only where it affects the outcome
of the program.
• Detecting and avoiding hazards ensures that necessary program order is
preserved.
Advanced Computer Architecture
• The data flow is the actual flow of data values among instructions that
produce results and those that consume them. Branches make the data flow
dynamic, since they allow the source of data for a given instruction to
come from many points.
• If the branch is taken, the DSUBU instruction will execute and will be
useless, but it will not affect the program results.
• This type of code scheduling is also a form of speculation, often called
software speculation, since the compiler is betting on the branch outcome.
Advanced Computer Architecture
Basic Compiler Techniques for Exposing ILP
• These techniques are crucial for processors that use static issue or static
scheduling.
• Basic Pipeline Scheduling and Loop Unrolling
• To keep a pipeline full, parallelism among instructions must be exploited by
finding sequences of unrelated instructions that can be overlapped in the pipeline.
• To avoid a pipeline stall, the execution of a dependent instruction must be
separated from the source instruction by a distance in clock cycles equal to the
pipeline latency of that source instruction.
• A compiler’s ability to perform this scheduling depends both on the amount of ILP
available in the program and on the latencies of the functional units in the pipeline.
Advanced Computer Architecture
How the compiler can increase the amount of available ILP by transforming
loops?
• Example:
Advanced Computer Architecture
• With scheduling:
Advanced Computer Architecture
• Loop unrolling – replicate the loop body multiple times and adjust the
loop termination code.
•
Advanced Computer Architecture
• The execution time of the unrolled loop has dropped to 3.5 clock cycles
per element, compared with 9 cycles per element before any unrolling or
scheduling and 7 cycles when scheduled but not unrolled.
Advanced Computer Architecture
• To obtain the final unrolled code we had to make the following decisions
and transformations:
• Determine that unrolling the loop would be useful by finding that the loop
iterations were independent, except for the loop maintenance code.
• Use different registers to avoid unnecessary constraints that would be forced by
using the same registers for different computations (e.g., name dependences).
• Eliminate the extra test and branch instructions and adjust the loop termination and
iteration code.
• Determine that the loads and stores in the unrolled loop can be interchanged by
observing that the loads and stores from different iterations are independent.
• Schedule the code, preserving any dependences needed to yield the same result as
the original code.
Advanced Computer Architecture
• Without branch prediction, the processor would have to wait until the
conditional jump instruction has passed the execute stage before the next
instruction can enter the fetch stage in the pipeline.
• The branch predictor attempts to avoid this waste of time by trying to
guess whether the conditional jump is most likely to be taken or not taken.
• The branch that is guessed to be the most likely taken is then fetched and
speculatively executed.
• Branch predictor keeps records of whether branches are taken or not.
When it encounters a conditional jump that has been seen several times
before, it base the prediction on this history.
Advanced Computer Architecture
• Static branch prediction: all decisions are made at compile time, before
the execution of the program. It always predict that a conditional jump will
not be taken, so always fetch the next sequential instruction.
• Dynamic branch prediction: It uses information about taken or not taken
branches gathered at run-time to predict the outcome of a branch.
Advanced Computer Architecture
•.
Advanced Computer Architecture
Instruction status:
• Issue –
• Get the next instruction from the head of the instruction queue (maintained in FIFO
order).
• If there is a matching reservation station that is empty, issue the instruction to the
station with the operand values, if they are currently in the registers.
• If there is not an empty reservation station, then there is a structural hazard and the
instruction stalls until a station or buffer is freed.
• If the operands are not in the registers, keep track of the functional units that will
produce the operands. This step renames registers, eliminating WAR and WAW
hazards.
Advanced Computer Architecture
Instruction status:
• Execute –
• If one or more of the operands is not yet available, monitor the common data bus
while waiting for it to be computed.
• When an operand becomes available, it is placed into any reservation station
awaiting it.
• When all the operands are available, the operation can be executed at the
corresponding functional unit.
• By delaying instruction execution until the operands are available, RAW
hazards are avoided.
Advanced Computer Architecture
Instruction status:
• Write result –
• When the result is available, write it on the CDB and from there into the registers
and into any reservation stations (including store buffers) waiting for this result.
• Stores are buffered in the store buffer until both the value to be stored and the store
address are available, then the result is written as soon as the memory unit is free.
• The combination of the common result bus and the retrieval of results from
the bus by the reservation stations implements the forwarding and
bypassing mechanisms used in a statically scheduled pipeline.
• However, a dynamically scheduled scheme introduces one cycle of latency
between source and result, since the matching of a result and its use cannot
be done until the Write Result stage.
Advanced Computer Architecture
Each reservation station has seven fields:
• Op—The operation to perform on source operands S1 and S2.
• Qj, Qk—The reservation stations that will produce the corresponding source
operand; a value of zero indicates that the source operand is already available
in Vj or Vk, or is unnecessary.
• Vj, Vk—The value of the source operands. Note that only one of the V fields
or the Q field is valid for each operand. For loads, the Vk field is used to hold
the offset field.
• A—Used to hold information for the memory address calculation for a load
or store. Initially, the immediate field of the instruction is stored here; after
the address calculation, the effective address is stored here.
• Busy—Indicates that this reservation station and its accompanying functional
unit are occupied.
Advanced Computer Architecture
Register file:
• Qi—The number of the reservation station that contains the operation
whose result should be stored into this register. If the value of Qi is blank
(or 0), no currently active instruction is computing a result destined for this
register, meaning that the value is simply the register contents.
Advanced Computer Architecture
•.
Advanced Computer Architecture
VLIW Approach
• VLIWs use multiple, independent functional units.
• Rather than attempting to issue multiple, independent instructions to the
units, a VLIW packages the multiple operations into one very long
instruction.
Advanced Computer Architecture
• A VLIW processor with instructions that contain five operations, including
one integer operation (which could also be a branch), two floating-point
operations, and two memory references.
Advanced Computer Architecture
Branch-Target Buffers
• A cache that stores the predicted address for the next instruction.
• Branch penalty – to reduce the branch penalty of a pipeline, we must
know whether the as-yet-undecoded instruction is a branch and, if so, what
the next program counter (PC) should be.
• If the PC of the fetched instruction matches an address in the prediction
buffer, then the corresponding predicted PC is used as the next PC.
Advanced Computer Architecture
•.
Advanced Computer Architecture
•.
Advanced Computer Architecture
Speculation
• Advantage:
• ability to uncover the events that can stall the pipeline early such as cache misses.
• Disadvantage:
• It takes time and energy, and the recovery of incorrect speculation further reduces
performance.
• To support the higher instruction execution rate needed to benefit from speculation,
the processor must have additional resources, which take silicon area and power.
Advanced Computer Architecture
Value Prediction
• It is a technique for increasing the amount of ILP available in a program.
• It attempts to predict the value that will be produced by an instruction.
• Limited success – most instructions likely produce a different value every
time they are executed.
• It is useful when an instruction produces a value chosen from a small set of
potential values (predict resulting value by correlating it with other
program behavior) or loads a value that changes infrequently.
• The results of value prediction are not sufficiently attractive to justify their
incorporation in real processors.
Advanced Computer Architecture
•.
Advanced Computer Architecture
Speed-up Challenge: Balancing Load
Advanced Computer Architecture
•.
Advanced Computer Architecture
Message-Passing Multiprocessors
• Each processor has private physical address space.
• Coordination is built with message passing (i.e. explicitly sending and
receiving information).
Advanced Computer Architecture
Cluster
• Example of message-passing parallel computer.
• Collections of computers that are connected to each other over their I/O
interconnect via standard network switches and cables.
• Each runs a distinct copy of the operating system.
•
Advanced Computer Architecture
Drawbacks of clusters
• The cost of administering a cluster of n machines is about the same as the
cost of administering n independent machines, while the cost of
administering a shared memory multiprocessor with n processors is about
the same as administering a single machine.
• Processors in a cluster are usually connected using the I/O interconnect of
each computer, whereas the cores in a multiprocessor are usually
connected on the memory interconnect of the computer.
Advanced Computer Architecture
Advantages of cluster:
• A cluster consists of independent computers connected through a local area
network, it is easier to replace a machine in a cluster than in a SMP.
• Clusters are constructed from whole computers and independent, scalable
networks, this isolation also makes it easier to expand the system without
bringing down the application that runs on top of the cluster.
• Lower cost, high availability, fault tolerance, and rapid incremental
expandability make clusters attractive to service providers for the World
Wide Web.
Advanced Computer Architecture
Ring Topology
• In the ring network, with P processors, the total network bandwidth would
be P times the bandwidth of one link.
Advanced Computer Architecture
Multistage Network
• A network that supplies a small switch at each node.
• Switches are smaller than processor-memory-switch nodes, and thus may
be packed more densely for lessening distance and increasing performance.
• Two popular multistage organizations:
1. Crossbar network
2. Omega network
Advanced Computer Architecture
•
Advanced Computer Architecture
Multiprocessing and Multithreading
• Initial computer performance improvements came from use of:
• Innovative manufacturing techniques
• Advancement of VLSI technology
• In later years,
• Most improvements came from exploitation of ILP.
• Both software and hardware techniques are being used.
• Pipelining, dynamic instruction scheduling, out of order execution, VLIW,
vector processing, etc.
• ILP now appears fully exploited
• Modern multiple-issue processors have become incredibly complex and
processor performance improvement through increasing complexity,
increasing silicon and increasing power seem to be diminishing.
Advanced Computer Architecture
User Threads
• Thread management done in user space.
• User threads are supported and managed without kernel support.
• Invisible to the kernel.
• If one thread blocks, entire process blocks.
• Limited benefits of threading.
Kernel Threads
• Kernel threads supported and managed directly by the OS.
• Kernel creates Light Weight Processes (LWPs).
• Most OS support kernel threads: Linux, Mac OS, Solaris, Windows.
Advanced Computer Architecture
Benefits of Threading
• Responsiveness:
• Threads share code and data.
• Thread creation and switching therefore much more efficient than that for
processes.
• As an example in Solaris:
• Creating threads 30 times less costly than processors.
• Context switching about 5 times faster than processes.
• Truly concurrent execution:
• Possible with processors supporting concurrent execution threads: SMP, multicore,
SMT, etc.
Advanced Computer Architecture
Thread Examples
• Independent threads occur naturally in several applications:
• Web Server: different http requests are threads.
• File Server
• Banking: independent transactions
• Desktop applications: file loading, display, computations, etc. can be
threads.
Advanced Computer Architecture