Super Scalar Architecture With Dynamic Branch Prediction
Super Scalar Architecture With Dynamic Branch Prediction
Abstract – As a novel approach, we propose a Super Scalar processors multiple instructions are fetched and passed to a
architecture with Dynamic Branch Prediction which exploits dispatcher. The dispatcher makes a rigorous comparison and
instruction level parallelism to increase the CPU throughput decides whether the instructions can be executed
at the same clock rate. A Super Scalar processor, unlike a simultaneously. The dispatcher then forwards the independent
scalar processor, executes more than one instruction during instructions to the available execution units of the processor,
a clock cycle by simultaneously dispatching multiple thereby executing the instructions in parallel. The more the
instructions to redundant functional units on the processor. instructions the Super Scalar processor is able to dispatch
Branch Prediction on the other hand, foretells the outcome simultaneously, the more the instructions gets completed in a
of conditional branch instructions. Excellent branch given cycle.
prediction techniques are essential for throughput
enhancement. In Dynamic Branch prediction the hardware The intricacy of Super Scalar architecture lies in the design of
influences the prediction while execution proceeds. an effective dispatch unit. The dispatcher needs to be able to
Prediction is decided on the computational history of the quickly and correctly determine whether instructions can be
program. executed in parallel, as well as dispatch them in such a way as
to keep as many execution units busy as possible.
The results prove that the CPI and thereby the throughput
of the processor has improved in comparison to the simple Branch predictors are crucial in today’s processors for
five stage pipeline processor, which executes one instruction achieving high performance. Branch Predictors, are broadly
every clock cycle. classified as Static Branch Predictors and Dynamic Branch
Predictors. The former technique predicts the outcome of a
Keywords: Super Scalar, Dynamic Branch Predictor, branch based solely on the branch instruction. Whereas, the
Reorder Buffer, Branch Table Buffer. latter technique, predicts the outcome based on the information
about the dynamic history of the branch instruction. Dynamic
1. INTRODUCTION branch prediction can be implemented in many ways, we make
Exploitation of parallelism is essential in enhancing the use of a 2-bit Dynamic Branch Predictor in our Super Scalar
throughput of a processor. The five-stage pipelined processor is architecture to enhance the performance of the processor.
one of the simplest methods used to accomplish parallelism. In
this simple processor, parallelism is accomplished by The performance parameter (CPI) for a set of benchmarks is
beginning the first steps of instruction fetching and decoding calculated using the simple five-stage pipeline and compared
before the prior instruction finishes its execution. Super Scalar with the Super Scalar architecture.
architecture enhances the concept of instruction pipelining and
decreases the idle time of the CPU components. 2. RELATED WORK
Currently, exhaustive research is being done to improve the
Super Scalar architecture includes a long instruction pipeline performance of superscalar architecture. Superscalar
and multiple identical execution units. In Super Scalar architectures have the capacity to reduce the clock cycles per
instruction by fetching more instructions, the pipeline stages instruction groups fetched and issued in parallel. Gordon
have to show improvement to exploit the fetch bandwidth. A Steven et al [3] analyzed the advantages and limitations of
recent study [1] [5]has indicated that the performance of large each of these processors in exploiting instruction level
instruction window that receives all the renamed registers can parallelism. They further proposed that in order for the
be substantially improved by partitioning the window into processors to achieve full performance, a confluence of
several small blocks each holding a dynamic code sequence. both VLIW and superscalar architecture is required. Thus a
Thus the performance has improved by a factor of 1.5 to 3.Also hybrid processor was designed that inculcated the
the loops show a strong tendency to exhibit vector like aggressive run time instruction scheduling in the VLIW
behavior as evident in our optimization. So vector tables have and the object code compatibility of the super scalar
been used. architecture over a wide range of implementations and
removes any requirements of no-ops.
Branch prediction in Superscalar architectures is a critical
A lot of focus has also been towards improving the fetching
parameter that affects the performance. A miss prediction
of instructions in a super scalar architecture. A fetch
in branches can considerably affect the performance of the
mechanism will be better if it provides higher performance,
superscalar processor. Dynamic scheduling is another
but also if it is less complex, takes fewer resources, requires
aspect that is of prime importance to the performance. So,
less chip area, or consumes less power Oliverio J Santana
active studies are being done on various branch prediction
et al[4] designed a novel fetch engine based on the
schemes like static and dynamic branch predictions, the
execution of long streams of sequential instructions An
effect of branch target buffers. A branch target buffer
instruction stream is a sequential run of instructions from
(BTB) which reduces the performance penalty of branches
the target of a taken branch to the next taken branch. A
in pipelined processors by predicting the path of the branch
single instruction stream may contain multiple basic blocks
and caching information used by the branch is analyzed [2].
as long as all the intermediate branches are not taken. As
Focus is on implementation of BTB with limited number of
such, an instruction stream is fully defined by its starting
bits. A method for discarding branches from the BTB is
instruction address and its length, since the behavior of the
examined. This method discards the branch with the
branches contained inside the stream is implicit in the
smallest expected value for improving performance; it
definition: all intermediate branches are not taken, while
outperforms the least recently used (LRU) strategy by a
the terminating branch is always taken.
small margin, at the cost of additional complexity.
Secondly, it resolves what information is to be stored in
3. METHODOLOGY
buffer. A BTB entry can consist of one or more of the
following: branch tag, prediction information, the branch
3.1 FIVE STAGE PIPELINE
target address, and instructions at the branch target. Various
BTB designs, with one or more of these fields, are
The classic five stage pipeline has five major stages for
evaluated and compared.
executing an instruction. The stages are namely:
Handling multiple instructions at a time reduces the clock 1. Instruction Fetch.
cycle per instruction significantly. The two most popular 2. Instruction Decode and Register Fetch.
architectures that make use of multiple instruction issue are 3. Execute.
the superscalar and VLIW architecture. In the Superscalar 4. Memory Access.
processor, the hardware decides which instructions to be 5. Write Back.
issued in parallel at run time while the VLIW processor re- A five stage pipeline can be thought of as a series of data paths
orders the original sequential code into fixed size shifted in time. The simple five stage pipeline fetches and
Figure1 above shows the main components in Super Scalar
architecture. The IF stage fetches multiple instructions
simultaneously and forwards it to the decode stage for further
execution. The IF stage makes use of the entries in the 2-bit
Branch Table Buffer (BTB) in order to determine the next
target address for the instruction fetch process. The BTB stores
Figure 1: Simple five stage pipeline branch and jump address, their target addresses and also their
executes only one instruction per cycle. Though the simple prediction information.
pipelining concept helps in reducing the Clock Cycles Per
Instruction (CPI), the CPI can further be improved by fetching BRANCH TARGET PREDICTION
and executing multiple instructions per cycle. This leads to a ADDRESS ADDRESS BITS
design of new architecture, Super Scalar architecture, which
improves CPI by exploiting the aforesaid parameters. ……………… ……………………
T
Figure4: State Diagram of a 2-bit Predictor.
The above figure shows the state flow in a 2-bit predictor. The
states are incremented if the prediction is correct or
decremented if a miss-prediction occurs. In a 2-bit predictor a
Figure2: Super Scalar Architecture prediction must miss twice before it is changed. The two bit
predictor is implemented in the BTB by assigning two state bits Various benchmark programs are simulated on our Super
for each entry in the BTB. The BTB table is indexed using the Scalar based Instruction set simulator. In the software
branch address to determine the prediction bit value and architecture of the simulator, an assembler is invoked
thereby determine the outcome of the conditional branch. which converts the assembly language programs into a
machine language. The output of the assembler is in
The Instruction window and the Issue stages separate the hexadecimal format. This assembler output, a text file, is
Decode stage from the Execution units of the architecture. The then passed to the simulator as input. The simulator
Instruction window will have all the instructions waiting to be simulates the machine language program and creates two
dispatched to the execution stage. We make use of an Issue output text files namely “Result.txt” and “memory.txt”
logic which examines the waiting instructions in the Instruction containing the register and memory location values
window. If there are no dependencies or hazards involved the respectively. Also the Clock Cycles Per Instruction (CPI)
logic simultaneously assigns a number of instructions to the and Clock value are displayed in the output files.
execution unit up to a maximum Issue bandwidth. The program
order of the Issued instructions is stored in a reorder buffer.
The instruction Issue from the instruction window is out-of- 5. CONCLUSION AND FUTURE WORK
order.
A Super Scalar architecture is proposed as an alternative to the
The instructions which are independent and have all its five-staged pipeline architecture. The architectures were
operands ready are dispatched to the execution units. The experimentally verified for few benchmarks and the
dispatch is usually not a pipeline stage. Both dispatch and performance analysis was made and plotted. The Super Scalar
execution process are done out-of-order. An instruction is architecture which executes multiple instructions per clock
completed when the execution stage finishes the computation cycle shows improvement in the CPI value, thereby improving
and the result is made available for forwarding and buffering. the throughput and performance of the processor. The
Instruction completion is out of program order. performance of the Super Scalar architecture will further be
improved and studied by incorporating complex branch
Once the instruction gets completed the next phase is the prediction schemes like correlator branch predictors, which
committing phase. Committing an operation means that the decides the branch outcome based on the history of other
results of the operation have been made permanent and the branch instructions in the program as well. The redundant
operation is retired from the scheduler. We implement the functional units will also be increased in number and more
Write Back stage in in-order format. Re-order buffers are made number of instructions will be fetched simultaneously.
use of for this in-order execution. The re-order buffer keeps the
original program order of the instructions and allows result REFERENCES
serialization during the Write Back Stage. The re-order buffer
is implemented as a circular FIFO buffer, the buffer entries are [1]. Vajapeyam, S. and Mitra, T. 1997. Improving
allocated in the issue stage and de allocated serially when the superscalar instruction dispatch and issue by exploiting
instruction retires. State bits are made use of in the re-order dynamic code sequences- Proceedings of the 24th Annual
buffer to check whether the instruction has completed its International Symposium on Computer Architecture
http://en.wikipedia.org/wiki/Superscalar