Project
Project
Ken Choi
Case Study for 32-bit Pipelined CPU design with New ALU Architecture
Project Policy: This final project will be done individually.
Copying source codes or report will call for ECE disciplinary action strictly .
1. Introduction
This handout describes the final project for ECE 429. The objective of this project is to
understand a 32-bit Pipelined Central Processing Unit (CPU). As the name of the design
declares, the word length of the data used in the circuits is 32 bits. Furthermore, since this
circuit is pipelined, more than one instruction can be executed simultaneously. The operation of
the circuit is synchronized by an externally set clock signal. Also the instruction signals for
addressing the memory file, selecting the Arithmetic Logic Unit (ALU) operands and specifying
the operation of the ALU are also external. The correct synchronization of those signals with the
critical data path delay of the circuit that will determine the minimum operating period is one of
the objectives of the project.
2. Circuit Description
An overview of the primary building blocks and signals of the CPU is shown in Fig. 1. As shown
in Fig. 1, the primary building blocks are the memory file and the ALU. The external clock signal
synchronizes the capture and release of data within the memory file block. The circuit is
pipelined and each instruction is explained in two clock cycles. In the first clock cycle, the
two decoders are used to decode the external address selection signals used for specifying the
contents of the memory file that should be read at the memory ports in each clock cycle.
Additionally, multiplexer blocks are used to select the operands for the ALU. In the next clock
cycle, the ALU executes the specified operation. The ALU results can be read from the outside
of the CPU through a tri-state buffer, based upon the value of the externally specified OEN
(output enable) signal. Finally the ALU results can be written back in the memory file in the
word specified by the Address B.
3. Memory File
The memory file of this design stores 32 32-bit words. There are two read ports in the
memory file and one write port. The words to be read in each clock cycle are specified by the
external 5-bit words address A and Address B. The internal configuration of the memory file is
illustrated below in Fig. 2.
As illustrated in Fig. 2, the primary storing element within the memory file is a D-register. The
output of each D-register is connected to the two output ports of the memory file through
tri-state buffers. The tri-state buffers are enabled by the decoded address (signals A and B) and
the contents of the D-registers appear at the output ports A and B respectively.
Furthermore, the value of address B specifies the word address in the memory file where the
results of the ALU computation are stored in the second cycle of instruction execution. The
writing of the ALU results within the memory file is synchronized by the clock signal.
As illustrated in Fig. 1 the operand of the CPU can be selected among the following:
- Operand A: Operand A can be selected between the read port A of the memory file and the
externally defined data in. The selection is done by the external signal ASEL.
- Operand B: Operand B can be selected between the read port B of the memory file and the
logic zero value. The selection is done by the external signal BSEL.
2/13
ECE429 Fall 2014 Prof. Ken Choi
Internally the ALU has three primary operation blocks: the multiplier, the adder and the
logic function block. These blocks are illustrated in Fig 3. The multiplier can be implemented
as 32-by-32 array-based multiplier. The multiplier executes the multiplication function. Notice,
however, that the result of the multiplication operation is 64 bits. Therefore, in order to store
the multiplication result back to memory file we need two clock cycles. Therefore the
multiplication instruction is executed in 3 clock cycles. This is made possible by pipelining the
multiplier unit in order to produce the 16 least significant bits (LSB) of the result in once clock
cycle and the following 16 most significant bits (MSB) in the next cycle. Notice however, that
the instruction immediately following a multiplication operation should select the MSB of the
multiplier at the output of the ALU. It should also specify the storage address of the MSB bits in
the memory file.
The adder circuit within the ALU is a 32-bit adder/subtractor circuit. It executes the addition
and subtraction operations. The selection of the operation is done by the two externally defined
operation select signals OPSEL. The same signals are used for specifying the operation
executed within the logic function block of the ALU. The final output of the ALU is specified by
the output select OUTSEL signals that control the final 4-to-1 multiplexer within the ALU.
Finally, the ALU creates a control output signal that can be used externally of the CPU:
- Adder overflow - the signal is 1 if there is an adder overflow.
5. Synchronization
To better understand the circuit synchronization sequence described below, please refer to Fig 1.
The operation of the CPU is synchronized by the external clock signal.
3/13
ECE429 Fall 2014 Prof. Ken Choi
The results of the ALU will be available of the circuit if the OEN signal is set, and it will also be
written back in the memory file. The address in memory where the ALU result is written is
specified by the address B value. The data will be written that word at the next positive edge of
the clock signal.
The control signals are set after a positive edge of the clock signal and should not change
before the next positive edge. The period of the clock signal is determined by the longest path
of data within the circuit.
Case Study-1
32-bit CPU design with Different Adders - Carry Ripple Adder, Carry
Lookahead Adder, Carry Skip Adder, and Carry Select Adder
(Source codes are provided)
6. Introduction: Case Study-1
We provide the soruce verilog code and test bench for the cpu design with Carry Ripple Adder
(cpu_CRA.v), Carry Lookahead Adder (cpu_CLA.v), Carry Skip Adder (cpu_CSA.v),
Carry Select Adder (cpu_CSeA.v), and testbench verilog (tb_cpu.v) so that you can do
the logical synthesis and physical synthesis by using IIT-ECE429 ASIC flow. You can follow the
standard cell based flow to synthesize and layout the CPU design. Please refer to the tutorial IV
for detailed information which we already conducted in Lab. 9.
of the design. Moreover, you will obtain the mapped circuit in cpu.vh. You should simulate it
with the Verilog models of the standard cells, i.e. osu05_stdcells.v, and compare the result with
the RTL simulation. The command is:
Once finished, you should check the final timing report in timing.rep.5.final in order to verify if
all the circuit timings are met. Moreover, you will obtain the circuit netlist in final.v, which
contains necessary buffers and inverters to overcome the interconnect delays in the signal
propagation network and the clock distribution network. You should simulate it with the Verilog
models of the standard cells, i.e. osu05_stdcells.v, and compare the result with the RTL
simulation and the post-synthesis simulation. The command is:
The four adder architectures that will be implemented in this project are listed below:
5/13
ECE429 Fall 2014 Prof. Ken Choi
A=1011
B=0110
A+B=10001
Figure 6 4-bit Carry Lookahead Adder
where
C1 g 0 p0C0
C2 g1 p1 ( g 0 p0C0 ) g1 p1 g 0 p1 p0C0
C3 g 2 p2 ( g1 p1 ( g 0 p0C0 )) g 2 p2 g1 p2 p1 g 0 p2 p1 p0C0
*Note: pi Ai Bi (or Ai Bi ) and gi Ai Bi
The upper half is implented by two independent 4-bit adders, one whose carry-in is hardwired
to 0, another whose carry-in is hardwired to 1. In parallel, these compute two alternative sums.
The carry-out from the previous 4-bit adder block controls multiplexers that select between the
two alternative sums. Following the same methodology, the two alternative carry-outs are
selected by carry-out from the previous block controling a multiplexer that selects the
appropriate carry-out for the next block. A structure of 16-bit Carry Select Adder blocks is
shown in Figure 9.
7/13
ECE429 Fall 2014 Prof. Ken Choi
1. Generate the display screenshot or the text output of the RTL simulation and the screenshot
from simvision with provided test bench (tb_cpu.v).
2. Synthesize the design and summarize cell.rep and timing.rep.
3. Provide the display screenshot or the text output of the post-synthesis simulation and the
screenshot from simvision.
4. Summarize timing.rep.5.final. What is the maximum clock frequency this circuit can run.
5. Provide the display screenshot or the text output of the post-P&R simulation and the
screenshot from simvision.
6. Generate a new test bech file (tb_test.v) for the following instruction set.
[0] STORE 5
[1] STORE AAAA_AAAA
[2] STORE 5555_5555
[3] STORE 0000_000A
[4] STORE 0000_0001
[5] STORE FFFF_FFFF
[6] STORE 0000_00C8
[7] STORE 0000_012C
[8] STORE 0000_0001
[9] STORE AAAA_AAAB
[10] STORE 5555_5555
[2][0] ADD
[1][2] ADD
[6][7] ADD
[0][3] ADD
[5][4] SUB
[5][8] ADD
[2][0] SUB
[9][10] ADD
[7] READ
[3] READ
[1] READ
7. Provide the display screenshot and the text output of the RTL simulation and the screenshot
from simvision for each cpu desgin (cpu_CRA.v, cpu_CLA.v, cpu_CSA.v, cpu_CSeA.v) with
the new generated test bench (tb_test.v).
8. Fill out the following performance comparison table after synthesis and analyze the results
(explain the reasons of your comparison results).
8/13
ECE429 Fall 2014 Prof. Ken Choi
Case Study-2
32-bit CPU design with New ALU Architecture
(Source codes are provided)
7. Introduction: Case Study-2: Comparator Design in the ALU for the 32-bit CPU
In this project, we will add a 32-bit comparator block into the ALU design.
The function of a 32-bit comparator in Verilog is shown in Table 1. Suppose we have two 32-bit
inputs (we assume them to be unsigned in this project) A and B. Since the result of comparing
them can be A > B , A = B and A < B. So two bits are needed to represent the comparison
result (two outputs f1 and f0). Note that when f1 = 1, it means two integers are equal.
Otherwise, f0 is used to determine the relation of A and B.
f1 f0
A>B 0 1
A<B 0 0
A=B 1 0
In this project, you are going to design the 32-bit comparator in a structural way. First of all,
we will explain the structure by using 4-bit comparator. Then we will give the structure view of
the 32-bit comparator, and you are supposed to finish the Verilog coding according to the
structure.
The structure of 4-bit comparator is shown in Fig. 10. It is designed in a tree structure. At the
bottom level (Level 2), there are 4 one bit comparators. Each of them is used to compare the
corresponding bit in A and B. The meaning of the output f1 and f0 are the same as the meaning
in Fig. 10 (f1f0=10 means a=b, f1f0=00 means a<b, f1f0=01 means a>b).
Notice that the final comparison result depends on the comparison result of the most significant
bit which has determined the relation of the two integers. Take the 4 bit comparator shown in
9/13
ECE429 Fall 2014 Prof. Ken Choi
Fig. 10 for example. If the results from MSB A[3] and B[3] has shown that A[3] > B[3] or A[3]
< B[3] (in other words, f1 = 0), then it means A > B or A < B. On the other hand, if A[3] =
B[3] (f1 = 1), then we have to refer to the comparison result of next significant bit A[2] and
B[2]. If A[2] and B[2] are equal, we have to compare A[1] and B[1], and so on. If all the 4 bits
are equal (f1 from all the four one bit comparators are all 1s), then A = B. In fact, rather than
comparing bits from MSB to LSB sequentially, we can do the comparison in parallel in order to
save time, as we can see from Fig. 10. Remember that the left part of f1 and f0 results always
have higher priority than the right part of the f1 and f0. To be more specific, for the component
of mux_4to2 in Fig. 10, if hi_f1 = 0 which means the relation of A and B has already been
determined, then its output f1 and f0 should be consistent with hi_f1 and hi_f0. Otherwise, f1
and f0 should be consistent with lo_f1 and lo_f0.
From Fig. 10, we can see that the number of mux_4to2 is 3 which is equal to 4 1 and the
level of the tree is 3 which is equal to log2(4) + 1. More generally, if two N-bit (N is the power
of 2) unsigned integers are compared, then the tree comparator will be (log2(N) + 1) levels,
and it will consists of N -1 mux_4to2 and N one_bit_comp.
10/13
ECE429 Fall 2014 Prof. Ken Choi
11/13
ECE429 Fall 2014 Prof. Ken Choi
1. Generate a new test bech file (tb_test_comp.v) for the following instruction set.
[0] STORE 5
[1] STORE AAAA_AAAA
[2] STORE 5555_5555
[3] STORE 0000_000A
[4] STORE 0000_0001
[5] STORE FFFF_FFFF
[6] STORE 0000_00C8
[7] STORE 0000_012C
[8] STORE 0000_0001
[9] STORE AAAA_AAAB
[10] STORE 5555_5555
[2][0] ADD
[1][2] ADD
[6][7] ADD
[0][3] ADD
[5][4] SUB
[5][8] ADD
[2][0] SUB
[9][10] ADD
[8] READ
[10] READ
[4] READ
[2] READ
[7] READ
[3] READ
[8][10] CMP
[4][2] CMP
[3][7] CMP
[10] READ
[2] READ
[7] READ
2. Provide the display screenshot or the text output of the RTL simulation and the screenshot
from simvision with the test bench (tb_test_comp.v).
12/13
ECE429 Fall 2014 Prof. Ken Choi
Good luck!
13/13