0% found this document useful (0 votes)
14 views3 pages

Study Guide Chapter 3

Chapter 3 discusses instruction-level parallelism (ILP) in CPU architecture, focusing on techniques like pipelining and dynamic scheduling to enhance performance. It covers the challenges posed by conditional code and dependencies, including data, name, and control dependencies, and introduces various optimization techniques such as loop unrolling and branch prediction. Additionally, it explores dynamic scheduling methods, register renaming, multiple issue, and multithreading techniques to improve CPU efficiency.

Uploaded by

o6pb5s1mp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

Study Guide Chapter 3

Chapter 3 discusses instruction-level parallelism (ILP) in CPU architecture, focusing on techniques like pipelining and dynamic scheduling to enhance performance. It covers the challenges posed by conditional code and dependencies, including data, name, and control dependencies, and introduces various optimization techniques such as loop unrolling and branch prediction. Additionally, it explores dynamic scheduling methods, register renaming, multiple issue, and multithreading techniques to improve CPU efficiency.

Uploaded by

o6pb5s1mp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Exam 2 Study Guide

Chapter 3 is on utilizing instruction-level parallelism (ILP) to improve performance in a CPU architecture.


ILP refers to the modern techniques of pipelining instructions to allow multiple instructions to be in
flight at the same time at different execution stages. This may be combined with dynamic scheduling as
well, allowing the re-ordering of instructions for better efficiency, or statically scheduled with any
optimization taking place at compile time. ILP is constrained by a few different factors, primarily the
presence of conditional code and dependencies. Conditional code may be handled in a few different
fashions to minimize its impact but remains a performance problem. The presence of conditional code
limits the ability to reorganize other code, as it is generally not possible to reschedule past the condition.
The book estimates that modern code has branches in 15-25% of instructions, meaning that only every
3-5 instructions can be reordered before running into a branch.

Dependencies generally come in one of three categories: data dependencies, name dependencies, and
control dependencies. Data dependencies occur when instructions can not be reordered due to the
results of one instruction either directly or indirectly requiring the completion of a previous instruction
to ensure correct results. In a simplified math example, if you had the equation A=B+C*D, you would
get a different result for A (disregarding selecting values specifically to avoid this, such as zero for all) if
you tried to change the order of operations within the equation for better efficiency. In a program, this
often involves doing math with a variable value that is being modified previously in the code. If the
math occurs before the modification, the result is not the expected value as the correct value of the
variable had not yet been computed.

These data dependencies fall into the patterns of read after write (RAW), write after write (WAW), and
write after read (WAR). Read after read also occurs, but is not a hazard, as no modification of data
occurs so values remain consistent. In read after write, you must maintain the order of operations to
ensure that a read occurring after a write receives the correct value (reversing the order would result in
reading the unmodified value). Write after write can cause an issue if a later write in a program is
moved in front of a previous write of the same location. This would allow the execution of the program
to continue with the incorrect data stored, as instead of overwriting the data to the second value, the
write that should have occurred first occurs second. This is generally not a large consideration due to
the order of operations in most CPU pipelines. Write after read occurs when you move a write in front
of a read for a piece of data. This results in the read receiving the (incorrectly) updated data, instead of
whatever the pre-write value was.

Name dependencies occur when multiple instructions interact with the same named register or memory
location. These have similar consequences to the data dependencies, but the instructions are not
chained together in any fashion, they are just utilizing the same resources.

In a control dependency, instructions are dependent if they can’t be reordered with respect to a branch
without changing the operation of the program. A simple example of this is an IF/THEN/ELSE
statement. The contents of the ELSE are control dependent upon the else branch; they can’t be moved
into the prior control branch without changing the operation of the program.
There are a number of techniques that can be used to expose and exploit ILP for increased performance.
These may be present in the compiler for compile-time optimizations, in the hardware for runtime
optimizations, or in both.

One of the fundamental hardware ILP optimizations is in intelligently scheduling the pipeline of the CPU.
The pipeline allows multiple instructions to be issued and worked on during each clock cycle, with each
of them being in a different execution stage. Pipeline stalls occur when dependencies occur in close
proximity and can’t be worked around, or if a branch is misinterpreted (among other things). These
stalls may be avoided if the instructions can be placed far enough apart to ensure that the dependent
instruction is not scheduled prior to the completion of the execution of the instruction it is dependent
upon.

Another technique of this type is loop unrolling. In a program structure where you have a loop that has
no dependencies between loop iterations, the compiler can pull the loop apart and restructure the
execution to allow for more efficient execution. For example, if you had an architecture that could issue
4 reads a clock, and a loop that required reading in two values per loop execution, the compiler could
restructure the loop to pull in all four values in one execution of the loop, cutting the repetitions of the
loop in half via duplicating work in each individual execution. This also serves to reduce branch control
hazards, as with less invocations of the loop body, you parse the control structure of the loop (loop
again or not) less times.

The next scheduling improvement has to do with mitigating against branch stalls via adding support for
intelligently deciding if a branch is likely to be taken or not. The most simple version of branch
prediction relies on a static table that tracks how often a branch was taken in the past and applies this to
make a decision on if a branch will be taken in the future. This does nothing for the first execution of a
branch, as there is no information on the history of it; it’s mainly for evaluating loops. A more
complicated version tracks execution of branches as a program is running and dynamically maintains a
status on the last behavior of a branch to predict if it will be taken again or not. These can come in a 1-
bit predictor, which flips every time branch behavior changes, and a 2-bit version that requires two
consecutive changes of behavior (yes-no-no for example) to change the prediction.

A more modern approach uses correlated branch predictors, such that the hardware can note patterns
between branches and apply this to future branch behavior prediction. For example, if there are
branches B1, B2, and B3, a correlated system could note that B3 is always taken if B1 is taken and
avoided if B1 is avoided. This allows the system to predict B3 with absolute efficiency in this situation.

Another approach is referred to as a tournament predictor. In this setup, the hardware tracks both the
behavior of a branch at execution time in a global predictor and the address of the branch in a local
predictor. The system can refer to both of these data points when guessing branch behavior.

Dynamic scheduling has also been referred to several times as a method for extracting ILP in an
architecture. This approach seeks to avoid pipeline stalls by moving instruction order around to better
utilize the processor pipeline. There are data dependency hazards with this that would not occur in a
static issue design, as you don’t have to account for the affects of moving dependent instructions
around. These hazards can be dealt with utilizing a variety of techniques to mitigate against data
hazards.
The most common method of avoiding data dependencies introduced via dynamic scheduling is via
register renaming, which is used in all modern desktop architectures. In this scheme, you have
addressable registers, or named registers, that instructions reference to perform their work. However,
there are a much larger number of available registers that exist in the implementation of the ISA, but are
not directly addressable. The CPU would then map instructions from a named register to an actual
register, allowing multiple variations of data to exist in a single named register, as the multiple versions
of the data exist in separate physical registers which are selected as needed to represent the named
register.

Register renaming is complemented by a reorder buffer (ROB) that allows the CPU to commit (ie,
actually completing the operation of an instruction) that is added to the standard issue/execute/write
cycle of an instruction. This approach allows a CPU to issue instructions (ie, read them from the
instruction queue) in order, dynamically reschedule them for execution, write the results to the ROB,
and then commit the results in order by shuffling through the committed results in the ROB. From an
external view, the CPU becomes and in-order static system, pulling in sequential instructions in order
and executing them in order. Internally it is dynamically scheduling execution, but the ROB allows the
commits to occur in-order to avoid dependencies.

Another technique that can be combined with speculation (or not) and with dynamic scheduling (or not)
is multiple issue. In this implementation, some functional units inside the CPU (ALU, FPU, etc.) will be
duplicated and/or functional parts have been updated to allow for intaking/exporting multiple
operations (register with multiple read/write ports, for example). Modern CPUs typically issue 4+
instructions per clock. This is used in all modern architectures, even simple static scheduled
architectures, but it works better with more complicated implementations as they allow the technique
more room to actually execute the instructions fetched. I won’t be asking you to duplicate or answer
questions about the implementation graphics on page 223, that would be beyond the scope of the
exam.

The last concept on the exam will be on multithreading techniques. These are various techniques for
switching the execution resources of the CPU core from one task to another based on either stalls in the
instruction stream (ie, if a thread stalls out, you should switch to another one while waiting for the stall
to clear on the first one) or based upon a scheduling algorithm. The basic approaches are coarse
multithreaded, which isn’t used in a modern architecture but involves relatively slow swapping between
active threads; fine multithreading, which swaps threads in each clock cycle (assuming one isn’t stalled,
in which case it would continue to work on the active one), and simultaneous multithreading (SMT), in
which instructions from multiple threads can co-exist in the CPU pipeline at the same time via larger
register files and register renaming to keep the context of both threads current with the addition of
allowing the CPU to issue instructions from multiple threads in the same clock cycle. SMT is commonly
used in modern CPUs to allow “extra” threads on the CPU which can utilize instruction issue slots which
previously would not have been utilized, increasing overall efficiency in most circumstances. The
technique is not an improvement (and may cause performance losses) if a single thread was already
utilizing all issue resources in an efficient fashion.

You might also like