Lecture 14 Building A Datapath Extended
Lecture 14 Building A Datapath Extended
Recall that, in Section 3, we designed an ALU based on (a) building blocks such as
multiplexers for selecting an operation to produce ALU output, (b) carry lookahead adders to
reduce the complexity and (in practice) the critical pathlength of arithmetic operations, and (c)
components such as coprocessors to perform costly operations such as floating point
arithmetic. We also showed that computer arithmetic suffers from errors due to fintie
precision, lack of associativity, and limitations of protocols such as the IEEE 754 floating
point standard.
4.1.1. Review
In Figure 4.1, the typical organization of a modern von Neumann processor is illustrated. Note
that the CPU, memory subsystem, and I/O subsystem are connected by address, data, and
control buses. The fact that these are parallel buses is denoted by the slash through each line
that signifies a bus.
Figure 4.1. Schematic diagram of a modern von Neumann processor, where the CPU is
denoted by a shaded box -adapted from [Maf01].
• Processor (CPU) is the active part of the computer, which does all the work of data
manipulation and decision making.
• Datapath is the hardware that performs all the required operations, for example, ALU,
registers, and internal buses.
• Control is the hardware that tells the datapath what to do, in terms of switching,
operation selection, data movement between ALU components, etc.
The processor represented by the shaded block in Figure 4.1 is organized as shown in Figure
4.2. Observe that the ALU performs I/O on data stored in the register file, while the Control
Unit sends (receives) control signals (resp. data) in conjunction with the register file.
Figure 4.2. Schematic diagram of the processor in Figure 4.1, adapted from [Maf01].
In MIPS, the ISA determines many aspects of the processor implementation. For example,
implementational strategies and goals affect clock rate and CPI. These implementational
constraints cause parameters of the components in Figure 4.3 to be modified throughout the
design process.
Such implementational concerns are reflected in the use of logic elements and clocking
strategies. For example, with combinational elements such as adders, multiplexers, or shifters,
outputs depend only on current inputs. However, sequential elements such as memory and
registers contain state information, and their output thus depends on their inputs (data values
and clock) as well as on the stored state. The clock determines the order of events within a
gate, and defines when signals can be converted to data to be read or written to processor
components (e.g., registers or memory). For purposes of review, the following diagram of
clocking is presented:
Here, a signal that is held at logic high value is said to be asserted. In Section 1, we discussed
how edge-triggered clocking can support a precise state transition on the active clock pulse
edge (either the rising or falling edge, depending on what the designer selects). We also
reviewed the SR Latch based on nor logic, and showed how this could be converted to a
clocked SR latch. From this, a clocked D Latch and the D flip-flop were derived. In particular,
the D flip-flop has a falling-edge trigger, and its output is initially deasserted (i.e., the logic
low value is present).
The register file (RF) is a hardware device that has two read ports and one write port
(corresponding to the two inputs and one output of the ALU). The RF and the ALU together
comprise the two elements required to compute MIPS R-format ALU instructions. The RF is
comprised of a set of registers that can be read or written by supplying a register number to be
accessed, as well (in the case of write operations) as a write authorization bit. A block diagram
of the RF is shown in Figure 4.4a.
(a)
(b)
(c)
Figure 4.4. Register file (a) block diagram, (b) implementation of two read ports, and (c)
implementation of write port - adapted from [Maf01].
Since reading of a register-stored value does not change the state of the register, no "safety
mechanism" is needed to prevent inadvertent overwriting of stored data, and we need only
supply the register number to obtain the data stored in that register. (This data is available at
the Read Data output in Figure 4.4a.) However, when writing to a register, we need (1) a
register number, (2) an authorization bit, for safety (because the previous contents of the
register selected for writing are overwritten by the write operation), and (3) a clock pulse that
controls writing of data into the register.
In this discussion and throughout this section, we will assume that the register file is structured
as shown in Figure 4.4a. We further assume that each register is constructed from a linear
array of D flip-flops, where each flip-flop has a clock (C) and data (D) input. The read ports
can be implemented using two multiplexers, each having log 2N control lines, where N is the
number of bits in each register of the RF. In Figure 4.4b, note that data from all N = 32 registers
flows out to the output muxes, and the data stream from the register to be read is selected using
the mux's five control lines. Similar to the ALU design presented in Section 3, parallelism is
exploited for speed and simplicity.
In Figure 4.4c is shown an implementation of the RF write port. Here, the write enable signal
is a clock pulse that activates the edge-triggered D flip-flops which comprise each register
(shown as a rectangle with clock (C) and data (D) inputs). The register number is input to an
N-to-2N decoder, and acts as the control signal to switch the data stream input into the Register
Data input. The actual data switching is done by and-ing the data stream with the decoder
output: only the and gate that has a unitary (one-valued) decoder output will pass the data into
the selected register (because 1 and x = x).
We next discuss how to construct a datapath from a register file and an ALU, among other
components.
Simple datapath components include memory (stores the current instruction), PC or program
counter (stores the address of current instruction), and ALU (executes current instruction). The
interconnection of these simple components to form a basic datapath is illustrated in Figure
4.5. Note that the register file is written to by the output of the ALU. As in Section 4.1, the
register file shown in Figure 4.6 is clocked by the RegWrite signal.
Implementation of the datapath for I- and J-format instructions requires two more components
- a data memory and a sign extender, illustrated in Figure 4.6. The data memory stores ALU
results and operands, including instructions, and has two enabling inputs (MemWrite and
MemRead) that cannot both be active (have a logical high value) at the same time. The data
memory accepts an address and either accepts data (WriteData port if MemWrite is enabled)
or outputs data (ReadData port if MemRead is enabled), at the indicated address. The sign
extender adds 16 leading digits to a 16-bit word with most significant bit b, to product a 32-
bit word. In particular, the additional 16 digits have the same value as b, thus implementing
sign extension in twos complement representation.
Figure 4.6. Schematic diagram of Data Memory and Sign Extender, adapted from [Maf01].
Implementation of the datapath for R-format instructions is fairly straightforward - the register
file and the ALU are all that is required. The ALU accepts its input from the DataRead ports
of the register file, and the register file is written to by the ALUresult output of the ALU, in
combination with the RegWrite signal.
Figure 4.7. Schematic diagram R-format instruction datapath, adapted from [Maf01].
4.2.2. Load/Store Datapath
The load/store datapath uses instructions such as lw $t1, offset($t2), where offset denotes a
memory address offset applied to the base address in register $t2. The lw instruction reads
from memory and writes into register $t1. The sw instruction reads from register $t1 and
writes into memory. In order to compute the memory address, the MIPS ISA specification
says that we have to sign-extend the 16-bit offset to a 32-bit signed value. This is done using
the sign extender shown in Figure 4.6.
The load/store datapath is illustrated in Figure 4.8, and performs the following actions in the
order given:
1. Register Access takes input from the register file, to implement the instruction, data, or
address fetch step of the fetch-decode-execute cycle.
2. Memory Address Calculation decodes the base address and offset, combining them to
produce the actual memory address. This step uses the sign extender and ALU.
3. Read/Write from Memory takes data or instructions from the data memory, and
implements the first part of the execute step of the fetch/decode/execute cycle.
4. Write into Register File puts data or instructions into the data memory, implementing
the second part of the execute step of the fetch/decode/execute cycle.
Figure 4.8. Schematic diagram of the Load/Store instruction datapath. Note that
the execute step also includes writing of data back to the register file, which is not shown in
the figure, for simplicity [MK98].
The load/store datapath takes operand #1 (the base address) from the register file, and sign-
extends the offset, which is obtained from the instruction input to the register file. The sign-
extended offset and the base address are combined by the ALU to yield the memory address,
which is input to the Address port of the data memory. The MemRead signal is then activated,
and the output data obtained from the ReadData port of the data memory is then written back
to the Register File using its WriteData port, with RegWrite asserted.
By taking the branch, the ISA specification means that the ALU adds a sign-extended offset
to the program counter (PC). The offset is shifted left 2 bits to allow for word alignment (since
22 = 4, and words are comprised of 4 bytes). Thus, to jump to the target address, the lower 26
bits of the PC are replaced with the lower 26 bits of the instruction shifted left 2 bits.
The branch instruction datapath is illustrated in Figure 4.9, and performs the following actions
in the order given:
1. Register Access takes input from the register file, to implement the instruction
fetch or data fetch step of the fetch-decode-execute cycle.
2. Calculate Branch Target - Concurrent with ALU #1's evaluation of the branch
condition, ALU #2 calculates the branch target address, to be ready for the branch if it
is taken. This completes the decode step of the fetch-decode-execute cycle.
3. Evaluate Branch Condition and Jump to BTA or PC+4 uses ALU #1 in Figure 4.9, to
determine whether or not the branch should be taken. Jump to BTA or PC+4 uses
control logic hardware to transfer control to the instruction referenced by the branch
target address. This effectively changes the PC to the branch target address, and
completes the execute step of the fetch-decode-execute cycle.
Figure 4.9. Schematic diagram of the Branch instruction datapath. Note that, unlike the
Load/Store datapath, the execute step does not include writing of results back to the register
file [MK98].
The branch datapath takes operand #1 (the offset) from the instruction input to the register file,
then sign-extends the offset. The sign-extended offset and the program counter (incremented
by 4 bytes to reference the next instruction after the branch instruction) are combined by ALU
#1 to yield the branch target address. The operands for the branch condition to evaluate are
concurrently obtained from the register file via the ReadData ports, and are input to ALU #2,
which outputs a one or zero value to the branch control logic.
MIPS has the special feature of a delayed branch, that is, instruction Ib which follows the
branch is always fetched, decoded, and prepared for execution. If the branch condition is false,
a normal branch occurs. If the branch condition is true, then Ib is executed. One wonders why
this extra work is performed - the answer is that delayed branch improves the efficiency of
pipeline execution, as we shall see in Section 5. Also, the use of branch-not-taken (where Ib is
executed) is sometimes the common case.
A single-cycle datapath executes in one cycle all instructions that the datapath is designed to
implement. This clearly impacts CPI in a beneficial way, namely, CPI = 1 cycle for all
instructions. In this section, we first examine the design discipline for implementing such a
datapath using the hardware components and instruction-specific datapaths developed in
Section 4.2. Then, we discover how the performance of a single-cycle datapath can be
improved using a multi-cycle implementation.
Let us begin by constructing a datapath with control structures taken from the results of
Section 4.2. The simplest way to connect the datapath components developed in Section 4.2
is to have them all execute an instruction concurrently, in one cycle. As a result, no datapath
component can be used more than once per cycle, which implies duplication of components.
To make this type of design more efficient without sacrificing speed, we can share a datapath
component by allowing the component to have multiple inputs and outputs selected by a
multiplexer.
The key to efficient single-cycle datapath design is to find commonalities among instruction
types. For example, the R-format MIPS instruction datapath of Figure 4.7 and the load/store
datapath of Figure 4.8 have similar register file and ALU connections. However, the following
differences can also be observed:
These two datapath designs can be combined to include separate instruction and data memory,
as shown in Figure 4.10. The combination requires an adder and an ALU to respectively
increment the PC and execute the R-format instruction.
Figure 4.10. Schematic diagram of a composite datapath for R-format and load/store
instructions [MK98].
Adding the branch datapath to the datapath illustrated in Figure 4.9 produces the augmented
datapath shown in Figure 4.11. The branch instruction uses the main ALU to compare its
operands and the adder computes the branch target address. Another multiplexer is required
to select either the next instruction address (PC + 4) or the branch target address to be the new
value for the PC.
Figure 4.11. Schematic diagram of a composite datapath for R-format, load/store, and branch
instructions [MK98].
4.3.1.1. ALU Control. Given the simple datapath shown in Figure 4.11, we next add the
control unit. Control accepts inputs (called control signals) and generates (a) a write
signal for each state element, (b) the control signals for each multiplexer, and (c) the ALU
control signal. The ALU has three control signals, as shown in Table 4.1, below.
Table 4.1. ALU control codes
The ALU is used for all instruction classes, and always performs one of the five functions in
the right-hand column of Table 4.1. For branch instructions, the ALU performs a subtraction,
whereas R-format instructions require one of the ALU functions. The ALU is controlled by
two inputs: (1) the opcode from a MIPS instruction (six most significant bits), and (2) a two-
bit control field (which Patterson and Hennesey call ALUop). The ALUop signal denotes
whether the operation should be one of the following:
The output of the ALU control is one of the 3-bit control codes shown in the left-hand column
of Table 4.1. In Table 4.2, we show how to set the ALU output based on the instruction opcode
and the ALUop signals. Later, we will develop a circuit for generating the ALUop bits. We
call this approach multi-level decoding -- main control generates ALUop bits, which are input
to ALU control. The ALU control then generates the three-bit codes shown in Table 4.1.
Recall that we need to map the two-bit ALUop field and the six-bit opcode to a three-bit ALU
control code. Normally, this would require 2 (2 + 6) = 256 possible combinations, eventually
expressed as entries in a truth table. However, only a few opcodes are to be implemented in
the ALU designed herein. Also, the ALU is used only when ALUop = 10 2. Thus, we can use
simple logic to implement the ALU control, as shown in terms of the truth table illustrated in
Table 4.2.
Table 4.2. ALU control bits as a function of ALUop bits and opcode bits [MK98].
In this table, an "X" in the input column represents a "don't-care" value, which indicates that
the output does not depend on the input at the i-th bit position. The preceding truth table can
be optimized and implemented in terms of gates, as shown in Section C.2 of Appendix C of
the textbook.
4.3.1.2. Main Control Unit. The first step in designing the main control unit is to identify the
fields of each instruction and the required control lines to implement the datapath shown in
Figure 4.11.
Recalling the three MIPS instruction formats (R, I, and J), shown as follows:
Additionally, we have the following instruction-specific codes due to the regularity of the
MIPS instruction format:
• Bits 25-21: base register for load/store instruction - always at this location
• Bits 15-0: 16-bit offset for branch instruction - always at this location
• Bits 15-11: destination register for R-format instruction - always at this location
• Bits 20-16: destination register for load/store instruction - always at this location
Note that the different positions for the two destination registers implies a selector (i.e., a mux)
to locate the appropriate field for each type of instruction. Given these contraints, we can add
to the simple datapath thus far developed instruction labels and an extra multiplexer for the
WriteReg input of the register file, as shown in Figure 4.12.
Figure 4.12. Schematic diagram of composite datapath for R-format, load/store, and branch
instructions (from Figure 4.11) with control signals and extra multiplexer for WriteReg signal
generation [MK98].
Here, we see the seven-bit control lines (six-bit opcode with one-bit WriteReg signal) together
with the two-bit ALUop control signal, whose actions when asserted or deasserted are given
as follows:
• RegDst
Deasserted: Register destination number for the Write register is taken from bits 20-16
(rt field) of the instruction
Asserted: Register destination number for the Write register is taken from bits 15-11 (rd
field) of the instruction
• RegWrite
Deasserted: No action
Asserted: Register on the WriteRegister input is written with the value on the WriteData
input
• ALUSrc
Deasserted: The second ALU operand is taken from the second register file output
(ReadData 2)
Asserted: the second alu operand is the sign-extended, lower 16 bits of the instruction
• PCSrc
Deasserted: No action
Asserted: Data memory contents designated by address input are present at the
ReadData output
• MemWrite
Deasserted: No action
Asserted: Data memory contents designated by address input are present at the
WriteData input
• RegWrite
Deasserted: The value present at the WriteData input is output from the ALU
Asserted: The value present at the register WriteData input is taken from data memory
Given only the opcode, the control unit can thus set all the control signals except PCSrc, which
is only set if the instruction is beq and the Zero output of the ALu used for comparison is true.
PCSrc is generated by and-ing a Branch signal from the control unit with the Zero signal from
the ALU. Thus, all control signals can be set based on the opcode bits. The resultant datapath
and its signals are shown in detail in Figure 4.13.
Figure 4.13. Schematic diagram of composite datapath for R-format, load/store, and branch
instructions (from Figure 4.12) with control signals illustrated in detail [MK98].
We next examine functionality of the datapath illustrated in 4.13, for the three major types of
instructions, then discuss how to augment the datapath for a new type of instruction.
4.3.2.1. R-format Instruction. Execution of an R-format instruction (e.g., add $t1, $t0, $t1)
using the datapath developed in Section 4.3.1 involves the following steps:
Note that this implementational sequence is actually combinational, becuase of the single-
cycle assumption. Since the datapath operates within one clock cycle, the signals stabilize
approximately in the order shown in Steps 1-4, above.
4.3.2.3. Branch Instruction. Execution of a branch instruction (e.g., beq $t1, $t2, offset)
using the datapath developed in Section 4.3.1 involves the following steps:
4.3.2.4. Final Control Design. Now that we have determined the actions that the datapath
must perform to compute the three types of MIPS instructions, we can use the information in
Table 4.3 to describe the control logic in terms of a truth table. This truth table (Table 4.3) is
optimized as shown in Section C.2 of Appendix C of the textbook to yield the datapath control
circuitry.
Table 4.3. ALU control bits as a function of ALUop bits and opcode bits [MK98].
The jump instruction provides a useful example of how to extend the single-cycle datapath
developed in Section 4.3.2, to support new instructions. Jump resembles branch (a conditional
form of the jump instruction), but computes the PC differently and is unconditional. Identical
to the branch target address, the lowest two bits of the jump target address (JTA) are always
zero, to preserve word alignment. The next 26 bits are taken from a 26-bit immediate field in
the jump instruction (the remaining six bits are reserved for the opcode). The upper four bits
of the JTA are taken from the upper four bits of the next instruction (PC + 4). Thus, the JTA
computed by the jump instruction is formatted as follows:
The jump is implemented in hardware by adding a control circuit to Figure 4.13, which is
comprised of:
• An additional multiplexer, to select the source for the new PC value. To cover all cases,
this source is PC+4, the conditional BTA, or the JTA.
• An additional control signal for the new multiplexer, asserted only for a jump
instruction (opcode = 2).
The single-cycle datapath is not used in modern processors, because it is inefficient. The
critical path (longest propagation sequence through the datapath) is five components for the
load instruction. The cycle time tc is limited by the settling time ts of these components. For a
circuit with no feedback loops, tc > 5ts. In practice, tc = 5kts, with large proportionality constant
k, due to feedback loops, delayed settling due to circuit noise, etc. Additionally, as shown in
the table on p. 374 of the textbook, it is possible to compute the required execution time for
each instruction class from the critical path information. The result is that the Load instruction
takes 5 units of time, while the Store and R-format instructions take 4 units of time. All the
other types of instructions that the datapath is designed to execute run faster, requiring three
units of time.
We next consider the basic differences between single-cycle and multi-cycle datapaths.
4.3.5.1. Cursory Analysis. Figure 4.15 illustrates a simple multicycle datapath. Observe the
following differences between a single-cycle and multi-cycle datapath:
• In the multicycle datapath, one memory unit stores both instructions and data, whereas
the single-cycle datapath requires separate instruction and data memories.
• The multicycle datapath uses one ALU, versus an ALU and two adders in the single-
cycle datapath, because signals can be rerouted throuh the ALU in a multicycle
implementation.
• In the single-cycle implementation, the instruction executes in one cycle (by design)
and the outputs of all functional units must stabilize within one cycle. In contrast, the
multicycle implementation uses one or more registers to temporarily store (buffer) the
ALU or functional unit outputs. This buffering action stores a value in a temporary
register until it is needed or used in a subsequent clock cycle.
Figure 4.15. Simple multicycle datapath with buffering registers (Instruction register,
Memory data register, A, B, and ALUout) [MK98].
Note that there are two types of state elements (e.g., memory, registers), which are:
1. Programmer-Visible (register file, PC, or memory), in which data is stored that is used
by subsequent instructions (in a later clock cycle); and
2. Additional State Elements(buffer registers), in which data is stored that is used in a later
clock cycle of the same instruction.
Thus, the additional (buffer) registers determine (a) what functional units will fit into a given
clock cycle and (b) the data required for later cycles involved in executing the current
instruction. In the simple implementation presented herein, we assume for purposes of
illustration that each clock cycle can accomodate one and only one of the following operations:
• Memory access
• Register file access (two reads or one write)
• ALU operation (arithmetic or logical)
4.3.5.2. New Registers. As a result of buffering, data produced by memory, register file, or
ALU is saved for use in a subsequent cycle. The following temporary registers are important
to the multicycle datapath implementation discussed in this section:
• Instruction Register (IR) saves the data output from the Text Segment of memory for a
subsequent instruction read;
• Memory Data Register (MDR) saves memory output for a data read operation;
• A and B Registers (A,B) store ALU operand values read from the register file; and
• ALU Output Register (ALUout) contains the result produced by the ALU.
The IR and MDR are distinct registers because some operations require both instruction and
data in the same clock cycle. Since all registers except the IR hold data only between two
adjacent clock cycles, these registers do not need a write control signal. In contrast, the IR
holds an instruction until it is executed (multiple clock cycles) and therefor requires a write
control signal to protect the instruction from being overwritten before its execution has been
completed.
4.3.5.3. New Muxes. we also need to add new multiplexers and expand existing ones, to
implement sharing of functional units. For example, we need to select between memory
address as PC (for a load instruction) or ALUout (for load/store instructions). The muxes also
route to one ALU the many inputs and outputs that were distributed among the several ALUs
of the single-cycle datapath. Thus, we make the following additional changes to the single-
cycle datapath:
• Add a multiplexer to the first ALU input, to choose between (a) the A register as input
(for R- and I-format instructions) , or (b) the PC as input (for branch instructions).
• On the second ALU, the input is selected by a four-way mux (two control bits). The
two additional inputs to the mux are (a) the immediate (constant) value 4 for
incrementing the PC and (b) the sign-extended offset, shifted two bits to preserve
alighment, which is used in computing the branch target address.
The details of these muxes are shown in Figure 4.16. By adding a few registers (buffers) and
muxes (inexpensive widgets), we halve the number of memory units (expensive hardware)
and eliminate two adders (more expensive hardware).
4.3.5.4. New Control Signals. The datapath shown in Figure 4.16 is multicycle, since it uses
multiple cycles per instruction. As a result, it will require different control signals than the
single-cycle datapath, as follows:
It is advantageous that the ALU control from the single-cycle datapath can be used as-is for
the multicycle datapath ALU control. However, some modifications are required to support
branches and jumps. We describe these changes as follows.
4.3.5.5. Branch and Jump Instruction Support. To implement branch and jump
instructions, one of three possible values is written to the PC:
1. ALU output = PC + 4, to get the next instruction during the instruction fetch step (to do
this, PC + 4 is written directly to the PC)
2. Register ALUout, which stores the computed branch target address.
3. Lower 26 bits (offset) of the IR, shifted left by two bits (to preserve alginment) and
concatenated with the upper four bits of PC+4, to form the jump target address.
where (a) ALUZero indicates if two operands of the beq nstruction are equal and (b) the result
of (ALUZero and PCWriteCond) determines whether the PC should be written during a
conditional branch. We call the latter the branch taken condition. Figure 4.16 shows the
resultant multicycle datapath and control unit with new muxes and corresponding control
signals. Table 4.4 illustrates the control signals and their functions.
Given the datapath illustrated in Figure 4.16, we examine instruction execution in each cycle
of the datapath. The implementational goal is balancing of the work performed per clock cycle,
to minimize the average time per cycle across all instructions. For example, each step would
contain one of the following:
• ALU operation
• Register file access (two reads or one write)
• Memory access (one read or one write)
Thus, the cycle time will be equal to the maximum time required for any of the preceding
operations.
Note: Since (a) the datapath is designed to be edge-triggered (reference Section 4.1.1) and (b)
the outputs of ALU, register file, or memory are stored in dedicated registers (buffers), we can
continue to read the value stored in a dedicated register. The new value, output from ALU,
register file, or memory, is not available in the register until the next clock cycle.
Figure 4.16. MIPS multicycle datapath [MK98].
Table 4.4. Multicycle datapath control signals and their functions [MK98].
In the multicycle datapath, all operations within a clock cycle occur in parallel, but successive
steps within a given instruction operate sequentially. Several implementational issues present
that do not confound this view, but should be discussed. One must distinguish between (a)
reading/writing the PC or one of the buffer registers, and (b) reads/writes to the register file.
Namely, I/O to the PC or buffers is part of one clock cycle, i.e., we get this essentially "for
free" because of the clocking scheme and hardware design. In contrast, the register file has
more complex hardware (as shown in Section 4.1.2) and requires a dedicated clock cycle for
its circuitry to stabilize.
4.3.6.1. Instruction Fetch. In this first cycle that is common to all instructions, the datapath
fetches an instruction from memory and computes the new PC (address of next instruction in
the program sequence), as represented by the following pseudocode:
The PC is sent (via control circuitry) as an address to memory. The memory hardware performs
a read operation and control hardware transfers the instruction at Memory[PC] into the IR,
where it is stored until the next instruction is fetched. Then, the ALU increments the PC by
four to preserve word alighment. The incremented (new) PC value is stored back into the PC
register by setting PCSource = 00 and asserting PCWrite. Fortunately, incrementing the PC
and performing the memory read are concurrent operations, since the new PC is not required
(at the earliest) until the next clock cycle.
Reading Assigment: The exact sequence of operations is described on p.385 of the textbook.
4.3.6.2. Instruction Decode and Data Fetch. Included in the multicycle datapath design is
the assumption that the actual opcode to be executed is not known prior to the instruction
decode step. This is reasonable, since the new instruction is not yet available until completion
of instruction fetch and has thus not been decoded.
As a result of not knowing what operation the ALU is to perform in the current instruction,
the datapath must execute only actions that are:
Therefore, given the rs and rt fields of the MIPS instruction format (per Figure 2.7), we can
suppose (harmlessly) that the next instruction will be R-format. We can thus read the operands
corresponding to rs and rt from the register file. If we don't need one or both of these operands,
that is not harmful. Otherwise, the register file read operation will place them in buffer
registers A and B, which is also not harmful.
Another action the datapath can perform is computation of the branch target address using the
ALU, since this is the instruction decode step and the ALU is not yet needed for instruction
execution. If the instruction that we are decoding in this step is not a branch, then no harm is
done - the BTA is stored in ALUout and nothing further happens to it.
We can perform these preparatory actions because of the <i.regularity< i="">of MIPS
instruction formats. The result is represented in pseudocode, as follows:</i.regularity<>
Reading Assigment: The exact sequence of low-level operations is described on p.384 of the
textbook.
The ALU constructs the memory address from the base address (stored in A) and the
offset (taken from the low 16 bits of the IR). Control signals are set as described on p.
387 opf the textbook.
The ALU takes its inputs from buffer registers A and B and computes a result according
to control signals specified by the instruction opcode, function field, and control
signals ALUop = 10. The control signals are further described on p. 387 of the textbook.
In branch instructions, the ALU performs the comparison between the contents of
registers A and B. If A = B, then the Zero output of the ALU is asserted, the PC is
updated (overwritten) with (1) the BTA computed in the preceding step (per Section
4.3.6.2), then (2) the ALUout value. If the branch is not taken, then the PC+4 value
computed during instruction fetch (per Section 4.3.6.1) is used. This covers all
possibilities by using for the BTA the value most recently written into the PC. Salient
hardware control actions are discussed on p. 387 of the textbook.
Here, the PC is replaced by the jump target address, which does not need the ALU be
computed, but can be formed in hardware as described on p. 387 of the textbook.
Reading Assigment: The control actions for load/store instructions are discussed on p.388 of
the textbook.
For an R-format completion, where
the data to be loaded was stored in the MDR in the previous cycle and is thus available for this
cycle. The rt field of the MIPS instruction format (Bits 20-16) has the register number, which
is applied to the input of the register file, together with RegDst = 0 and an asserted RegWrite
signal.
Reading Assigment: The exact sequence of operations is described on p.385 of the textbook.
From the preceding sequences as well as their discussion in the textbook, we are prepared to
design a finite-state controller, as shown in the following section.
In the single-cycle datapath control, we designed control hardware using a set of truth tables
based on control signals activated for each instruction class. However, this approach must be
modified for the multicycle datapath, which has the additional dimension of time due to the
stepwise execution of instructions. Thus, the multicycle datapath control is dependent on the
current step involved in executing an instruction, as well as the next step.
There are two alternative techniques for implementing multicycle datapath control. First,
a finite-state machine (FSM) or finite state control (FSC) predicts actions appropriate for
datapath's next computational step. This prediction is based on (a) the status and control
information specific to the datapath's current step and (b) actions to be performed in the next
step. A second technique, called microprogramming, uses a programmatic representation to
implement control, as discussed in Section 4.5. Appendix C of the textbook shows how these
representations are translated into hardware.
An FSM consists of a set of states with directions that tell the FSM how to change states. The
following features are important:
Implementationally, we assume that all outputs not explicitly asserted are deasserted.
Additionally, all multiplexer controls are explicitly specified if and only if they pertain to the
current and next states. A simple example of an FSM is given in Appendix B of the textbook.
The FSC is designed for the multicycle datapath by considering the five steps of instruction
execution given in Section 4.3, namely:
1. Instruction fetch
2. Instruction decode and data fetch
3. ALU operation
4. Memory access or R-format instruction completion
5. Memory access completion
Each of these steps takes one cycle, by definition of the multicycle datapath. Also, each step
stores its results in temporary (buffer) registers such as the IR, MDR, A, B, and ALUout. Each
state in the FSM will thus (a) occupy one cycle in time, and (b) store its results in a temporary
(buffer) register.
From the discussion of Section 4.3, observe that Steps 1 and 2 are indentical for every
instruction, but Steps 3-5 differ, depending on instruction format. Also note that oafter
completion of an instruction, the FSC returns to its initial state (Step 1) to fetch another
instruction, as shown in Figure 4.17.
Figure 4.17. High-level (abstract) representation of finite-state machine for the multicycle
datapath finite-state control. Figure numbers refer to figures in the textbook [Pat98,MK98].
Let us begin our discussion of the FSC by expanding steps 1 and 2, where State 0 (the initial
state) corresponds to Step 1.
4.4.2.1. Instruction Fetch and Decode. In Figure 4.18 is shown the FSM representation for
instruction fetch and decode. The control signals asserted in each state are shown within the
circle that denotes a given state. The edges (lines or arrows) between states are labelled with
the conditions that must be fulfilled for the illustrated transition between states to occur.
Patterson and Hennessey call the process of branching to different states decoding, which
depends on the instruction class after State 1 (i.e., Step 2, as listed above).
Figure 4.18. Representation of finite-state control for the instruction fetch and decode states
of the multicycle datapath. Figure numbers refer to figures in the textbook [Pat98,MK98].
4.4.2.2. Memory Reference. The memory reference portion of the FSC is shown in Figure
4.19. Here, State 2 computes the memory address by setting ALU input muxes to pass the A
register (base address) and sign-extended lower 16 bits of the offset (shifted left two bits) to
the ALU. After address computation, memory read/write requires two states:
• State 3: Performs memory access by asserting the MemRead signal, putting memory
output into the MDR.
• State 5: Activated if sw (store word) instruction is used, and MemWrite is asserted.
In both states, the memory is forced to equal ALUout, by setting the control signal IorD = 1.
Figure 4.19. Representation of finite-state control for the memory reference states of the
multicycle datapath. Figure numbers refer to figures in the textbook [Pat98,MK98].
When State 5 completes, control is transferred to State 0. Otherwise, State 3 completes and
the datapath must finish the load operation, which is accomplished by transferring control to
State 4. There, MemtoReg = 1, RegDst = 0, and the MDR contents are written to the register
file. The next state is State 0.
4.4.2.3. R-format Execution. To implement R-format instructions, FSC uses two states, one
for execution (Step 3) and another for R-format completion (Step 4), per Figure 4.20. State 6
asserts ALUSrcA and sets ALUSrcB = 00, which loads the ALU's A and B input registers
from register file outputs. The ALUop = 10 setting causes the ALU control to use the
instruction's funct field to set the ALU control signals to implement the designated ALU
operation.
State 7 causes (a) the register file to write (assert RegWrite), (b) rd field of the instruction to
have the number of the destination register (assert RegDst), and (c) ALUout selected as having
the value that must be written back to the register file as the result of the ALU operation (by
deasserting MemtoReg).
Figure 4.20. Representation of finite-state control for the R-format instruction execution
states of the multicycle datapath. Figure numbers refer to figures in the textbook
[Pat98,MK98].
4.4.2.4. Branch Control. Since branches complete during Step 3, only one new state is
needed. In State 8, (a) control signas that cause the ALU to compare the contents of its A and
B input registers are set (i.e., ALUSrcA = 1, ALUSrcB = 00, ALUop = 01), and (b) the PC is
written conditionally (by setting PCSrc = 01 and asserting PCWriteCond). Note that setting
ALUop = 01 forces a subtraction, hence only the beq instruction can be implemented this way.
(a) (b)
Figure 4.21. Representation of finite-state control for (a) branch and (b) jump instruction-
specific states of the multicycle datapath. Figure numbers refer to figures in the textbook
[Pat98,MK98].
4.4.2.5. Jump Instruction. Similar to branch, the jump instruction requires only one state (#9)
to complete execution. Here, the PC is written by asserting PCWrite. The value written to the
PC is the lower 26 bits of the IR with the upper four bits of PC, and the lower two bits equal
to 002. This is done by setting PCSrc = 102.
The composite FSC is shown in Figure 4.22, which was constructed by composing Figures
4.18 through 4.21.
Figure 4.22. Representation of the composite finite-state control for the MIPS multicycle
datapath [MK98].
When computing the performance of the multicycle datapath, we use this FSM representation
to determine the critical path (maximum number of states encountered) for each instruction
type, with the following results:
• Load: 5 states
• Store: 4 states
• R-format ALU instructions: 4 states
• Branch: 3 states
• Jump: 3 states
Since each state corresponds to a clock cycle (according to the design assumption of the FSC
controller in Section 4.4.2), we have the following expression for CPI of the multicycle
datapath:
Reading Assigment: Know in detail the example computation of CPI for the multicycle
datapath, beginning on p.397 of the textbook.
The textbook example shows CPI for the gcc benchmark is 4.02, a savings of approximately
20 percent over the worst-case CPI (equal to 5 cycles for all instructions, based the single-
cycle datapath design constraint that all instructions run at the speed of the slowest).
The FSC can be implemented in hardware using a read-only memory (ROM) or programmable
logic array (PLA), as discussed in Section C.3 of the textbook. Combinatorial logic
implements the transition function and a state register stores the current state of the machine
(e.g., States 0 through 9 in the development of Section 4.4.2). The inputs are the IR opcode
bits, and the outputs are the various datapath control signals (e.g., PCSrc, ALUop, etc.)
We next consider how the preceding function can be implemented using the technique
of microprogramming.
While the finite state control for the multicycle datapath was relatively easy to design, the
graphical approach shown in Section 4.4 is limited to small control systems. We implemented
only five MIPS instruction types, but the actual MIPS instruction set has over 100 different
instructions. Recall that the FSC of Section 4.4 required 10 states for only five instruction
types, and had CPI ranging from three to five. Now, observe that MIPS has not only 100
instructions, but CPI ranging from one to 20 cycles. A control system for a realistic instruction
set (even if it is RISC) would have hundreds or thousands of states, which could not be
represented conveniently using the graphical technique of Section 4.4.
The implementation of each microinstruction should, therefore, make each field specify a set
of nonoverlapping values. Signals that are never asserted concurrently can thus share the same
field. Table 4.5 illustrates how this is realized in MIPS, using seven fields. The first six fields
control the datapath, while the last field controls the microinstruction sequencing (deciding
which microinstruction will be executed next).
In the current subset of MIPS whose multicycle datapath we have been implementing,
we need two dispatch tables, one each for State 1 and State 2. The use of a dispatch
table numbered i is indicated in the microinstruction by putting Dispatch i in
the Sequencing field.
Table 4.6 summarizes the allowable values for each field of the microinstruction and the effect
of each value.
PCWrite control ALU Write the output of the ALU into the PC register
ALUout-cond If the ALU's Zero output is high, write the
contents of ALUout
into the PC register
Jump address Write the PC with the jump address from the
instruction
In this section, we use the fetch-decode-execute sequence that we developed for the multicycle
datapath to design the microprogrammed control. First, we observe that sometimes an
instruction might have a blank field. This is permitted when:
• A field that controls a functional unit (e.g., ALU, register file, memory) or causes state
information to be written (e.g., ALU dest field), when blank, implies that no control
signals should be asserted.
• A field that only specifies control of an input multiplexer for a functional unit, when
left blank, implies that the datapath does not care about what value the output of the
mux has.
4.5.2.1. Instruction Fetch and Decode, Data Fetch. Each instruction execution first fetches
the instruction, decodes it, and computes both the sequential PC and branch target PC (if
applicable). The two microinstructions are given by:
• ALU control, SRC1, and SRC2 are set to compute PC+4, which is written to ALUout.
The memory field reads the instruction at address equal to PC, and stores the instruction
in the IR. The PCWrite control causes the ALU output (PC + 4) to be written into the
PC, while the Sequencing field tells control to go to the next microinstruction.
• The label field (value = fetch) will be used to transfer control in the next Sequencing
field when execution of the next instruction begins.
• ALU control, SRC1, and SRC2 are set to store the PC plus the sign-extended, shifted
IR[15:0] into ALUout. Register control causes data referenced by the rs and rt fields to
be placed in ALU input registers A and B. output (PC + 4) to be written into the PC,
while the Sequencing field tells control to go to dispatch table 1 for the next
microinstruction address.
4.5.2.2. Dispatch Tables. Patterson and Hennessey consider the dispatch table as
a case statement that uses the opcode field and dispatch table i to select one of Ni different
labels. For in Dispatch Table #1 (i = 1, N i = 4) we have label Mem1 for memory reference
instructions, Rformat1 for arithmetic and logical instructions, Beq1 for conditional branches,
and Jump1 for unconditional branches. Each of these labels points to a different
microinstruction sequence that can be thought of as a kind of subprogram. Each microcode
sequence can be thought of as comprising a small utility that implements the desired capability
of specifying hardware control signals.
The details of each microinstruction are given on pp. 405-406 of the textbook.
4.5.2.5. Branch and Jump Execution. Since we assume that the preceding microinstruction
computed the BTA, the microprogram for a conditional branch requires only the following
microinstruction:
Here, we have added the SW2 microinstruction to illustrate the final step of the store
instruction.
Observe that these ten instructions correspond directly to the ten states of the finite-state
control developed in Section 4.4. In more complex machines, microprogram control can
comprise tens or hundreds of thousands of microinstructions, with special-purpose registers
used to store intermediate data.
It is interesting to note that this is how microprogramming actually got started, by making the
ROM and counter very fast. This represented a great advance over using slower main memory
for microprogram storage. Today, however, advances in cache technology make a separate
microprogram memory an obsolete development, as it is easier to store the microprogram in
main memory and page the parts of it that are needed into cache, where retrieval is fast and
uses no extra hardware.
If control design was not hard enough, we also have to deal with the very difficult problem of
implementing exceptions and interrupts, which are defined as follows:
In this section, we discuss control design required to handle two types of exceptions: (1) an
indefined instruction, and (2) arithmetic overflow. These exceptions are germane to the small
language (five instructions) whose implementation we have been exploring thus far.
If program execution is to continue after the exception is detected and handled, then the EPC
register helps determine where to restart the program. For example, the exception-causing
instruction can be repeated byt in a way that does not cause an exception. Alternatively, the
next instruction can be executed (in MIPS, this instruction's address is $epc + 4).
For the OS to handle the exception, one of two techniques are employed. First, the machine
can have Cause and EPC registers, which contain codes that respectively represent the cause
of the exception and the address of the exception-causing instruction. A second method
uses vectored interrups , where the address to which control is transferred following the
exception is determined by the cause of the exception. If vectored interrupts are not employed,
control is tranferred to one address only, regardless of cause. Then, the cause is used to
determine what action the exception handling routine should take.
4.5.4.2. Hardware Support. MIPS uses the latter method, called non-vectored exceptions. To
support this capability in the datapath that we have been developing in this section, we need
to add the following two registers:
• EPC: 32-bit register holds the address of the exception-causing instruction, and
• Cause: 32-bit register contains a binary code that describes the cause or type of
exception.
Two additional control signals are needed: EPCWrite and CauseWrite, which write the
appropriate information to the EPC and Cause registers. Also required in this particular
implementation is a 1-bit signal to set the LSB of Cause to be 0 for an undefined instruction,
or 1 for arithmetic overflow. Of further use is an address A E that points to the exception
handling routine to which control is transferred. In MIPS, we assume that AE = C000000016.
In the previous datapath developed through Section 4.4, the PC input is taken from a four-way
mux that has three inputs defined, which are: PC+4, BTA, and JTA. Without adding control
lines, we can add a fourth possible input to the PC, namely A E, which is written to the PC by
setting PCsource = 112.
Unfortunately, we cannot simply write the PC into the EPC, since the PC is incremented at
instruction fetch (Step 1 of the multicycle datapath) instead of instruction execution (Step 3)
when the exception actually occurs. Thus, when an exception is detected, the ALU must
subtract 4 from the PC and the ALUout register contents must be written to the EPC. It is
fortunate that this requires no additional control signals or lines in this particular datapath
design, since 4 is already a selectable ALU input (used for incrementing the PC during
instruction fetch, and is selected via ALUsrcB control signal).
Hardware support for the datapath modifications needed to implement exception handling in
the simple case illustrated in this section is shown in Figure 4.23. In the finite-state diagrams
of Figure 4.24 and 4.25, we see that each of the preceding two types of exceptions can be
handled using one state each. For each exception type, the state actions are: (1) set
the Cause register contents to reflect exception type, (2) compute and save PC-4 into the EPC
to make avaialble the return address, and (3) write the address A E to the PC so control can be
transferred to the exception handler. To update the finite-state control (FSC) diagram of Figure
4.22, we ned to add the two states shown in Figure 4.24.
Figure 4.23. Representation of the composite datapath architecture and control for the MIPS
multicycle datapath, with provision for exception handling [MK98].
Thus far, we have discussed exceptions and how to handle them, and have illustrated the
requirements of hardware support in the multicycle datapath developed in this section. In the
following section, we complete this discussion with an overview of the necessary steps in
exception detection.
4.5.4.3. Exception Detection. Each of the two possible exception types in our example MIPS
multicycle datapath is detected differently, as follows:
• Undefined Instruction: Finite state control must be modifed to define the next-state
value as 10 (the eleventh state of our control FSM) for all operation types other than the
five that are allowed (i.e., lw, sw, beg, jump, and R-format). In the FSM diagram of
Figure 4.25, this is shown as other.
• Arithmetic Overflow: Recall that an ALU can be designed to include overflow detection
logic with a signal output from the ALU called overflow, which is asserted if overflow
is detected. This is used to specify the next state for State 7 in the FSM of Figure 4.25.
Figure 4.24. Representation of the finite-state models for two types of exceptions in the MIPS
multicycle datapath [MK98].
Figure 4.25. Representation of the composite finite-state control for the MIPS multicycle
datapath, including exception handling [MK98].