FPGA
FPGA
Abstract—This paper describes an FPGA design that performs increase the area because more registers are required to store
4x4 matrix multiplication. The goal of the design is to optimize more bits. Performance, defined as throughput divided by area,
throughput, area, and accuracy. The design of our matrix has a cost weight of 0.9. Accuracy has a cost weight of 0.1.
multiplier consists of four main parts: fractional binary numbers
(fixed point notation), binary multiplication, matrix addition, Using this evaluation method, the performance, the area, and
and fetch routine. Each part is designed and optimized to find the accuracy need to be balanced in order to achieve the best
the optimal balance among the throughput, the area, and the result.
accuracy. According to the test results, the design with the
optimal result used a 3-stage pipeline from the BRAM block II. S YSTEM D ESIGN
to the output of the summation block, 13-bit representation of
binary values, shifting and addition to replace multipliers, and The design of our matrix multiplier consists of four main
an inexpensive fetch module. parts: fractional binary numbers (fixed point notation), binary
multiplication, matrix addition, and fetch routine. These are
I. I NTRODUCTION
explained in the subsections below.
This paper describes an FPGA design that performs 4x4
matrix multiplication. The design is implemented with Virtex- A. Fractional Binary Numbers
5 using Xilinx ISE. A matrix with input integer values as its
Binary numbers in general represent integer values. Since
elements is multiplied with another matrix whose elements
Matrix B contains fractions and mixed numbers, the output can
have constant values as shown in Figure 1. For fetching input
be fractions and mixed numbers. Binary strings can represent
fractions and mixed numerals if explicitly defined beforehand.
The notation used for this project is fixed point notation.
For fixed point notation, a fixed point must be chosen to split
the binary string. The bits to the left of the fixed point represent
integer values, while the bits to the right of the fixed point
represent fractional values. Figure 2 shows a 7-bit example
with the fixed point before the 3rd LSB. As shown, the weights
of each bit are still powers of two, it is just that to the right
Figure 1. Maxtrix A x Martix B
of the fixed point the weights are inverse powers of two.
values, a good candidate is Xilinx Block RAM (BRAM). It is important to choose an appropriate fixed point lo-
Since the second matrix contains fractional values, the binary cation as it will affect accuracy of this system. The range
values must be able to represent fractional values. Although of Matrix A are integer values from 1 to 16. Therefore,
IEEE floating point is the standard representation, the design maximum value for Matrix C occurs when Matrix A is all
uses fixed point notation for enhanced performance, which is 16’s. The largest element from this matrix multiplication is
explained in more details in section II-A. 16 × 1/8 + 3/2 + 5 + 140/123 ≈ 124.21138. The integer
The number of bits chosen for the fixed point notation portion of 124 is represented in binary by 1111100. Thus, there
directly affects the accuracy, the area, and the throughput of must be 7 bits to the left of the floating point to represent this
the design. Because the fixed point notation only estimates maximum integer number.
the fractional values, the accuracy may not be perfect. The
throughput is given as the number of matrices calculated per
second. The area is given as the number of used devices such
as registers. The goal of the design is to optimize throughput,
area, and accuracy. There is a trade off between the three
criteria. For instance, increasing the number of bits to represent
the matrix element values will improve the accuracy, but it will Figure 2. 7 Bit Fixed Point Example
Each of the elements in Matrix B can be represented exactly B14 = 1/4 , B24 = 2, B31 = 1/2 , and B41 = 1/4 . Dividing by
by a 16-bit fixed point binary string except 7/15 or 140/123 . 8 is shifting input three bits to the right, 4 is shifting input
Thus, these numbers must be estimated which affects the two bits to the right, and 2 is shifting input one bit to the
accuracy of the system. Table I shows fixed point estimations right. Multiplying by 2 is shifting the input one bit to the
for 7/15 and 140/123 . left. However, as with multiplying by unity, the input is first
shifted by the fractional bit width. Multiplying an input by
Mixed Number Binary Fixed Point integer powers of two only requires one shifting step.
7/
15 0.011101110... Multiplying by integers other than powers of two adds an
140/ 1.001000110...
additional complexity. These elements are B13 , B34 = 3 and
123
B32 = 5. These integers can be rewritten as 3 = (2 + 1) and
Table I 5 = (4 + 1). These equations contain a power of 2 and adds 1.
M IXED N UMBER E STIMATION
To multiply a given input Aij by 3, first Aij is multiplied by
2 by shifting. Then Aij is added to the previous shifted result.
In fact, the entire 16 bit allowance is not needed to represent Likewise, the same concept is applied to multiplying by 5.
the other numbers in Matrix B be exactly. Only 3 bits to right These multiplication modules requires two steps to perform
of the fixed point are needed for the elements in Matrix B the operation. There is a shifting step and an addition step.
except for 7/15 or 140/123 . This allows the accuracy to depend The next step in complexity from multiplying by integers
only on 7/15 and 140/123 . Thus, the minimum bound for output are multiplying by fractions with powers of two in the de-
number size is 10 bits: 7 bits for the integer and 3 bits for the nominator. These numbers are B21 , B44 = 3/4 , B22 = 3/2 ,
fraction. To increase accuracy, the fraction portion can increase and B23 = 3/8 . These numbers can be expressed as a
to 9 bits. However, this will affect performance as the area will sum of inverse powers of two: 3/4 = 1/4 + 1/4 + 1/4 ,
3
increase. We chose 11 bits for the initial design to estimate /2 = 1 + 1/2 , and 3/8 = 1/8 + 1/8 + 1/8 . Adders in
7
/15 by 0.0111 and estimate 140/123 by 1.0010. There is a practice contain many combinational logic, and should be
trade off between performance and accuracy, and the number minimized to increase performance. Multiplying by 3/4 and
3
of bits used for the fractional portion is revisited during the /8 contains two adders. It is desirable to optimize 3/4 and
3
optimization phase. /8 to reduce the number of adders. Xilinx’s synthesis tool
contains primitives for adders and subtractors. If subtractor
B. Binary Multiplication primitives are used, then the following expressions can be
Binary multiplication is performed with one number from used: 3/4 = 1 − 1/4 , and 3/8 = 1/2 − 1/8 . Instead of using
Matrix A and one number from Matrix B. The value in Matrix two adders, one subtractor is used. To multiply by 3/8 , a given
A is an integer value from 1 to 16, while a value in Matrix input Aij is multiplied by 1/2 and 1/8 . The two results are
B is a fixed point binary number. The multiplication function then subtracted. The same concept is applied to 3/4 and 3/2 .
must also return a fixed point binary number of the same size Therefore, these multiplication modules contain two steps as
as the number from Matrix B. well. There is a shifting step and an addition/subtraction step.
A general binary multiplication circuit that takes two un- The last two elements of Matrix B, B34 = 7/15 and B42 =
known binary numbers and multiplies them together is very 140
/123 , must be estimated as discussed previously. The initial
complex. This circuit complexity will increase area size and design used 4 bits to represent the fractional portion. Therefore
delay. Both these issues degrade performance. However, Ma- from Table I, these numbers are estimated by truncation with
trix B is essentially a constant and can be hardcoded into the B34 = 0.0111 and B42 = 1.0010. With these estimations,
design. Therefore, a general binary multiplier circuit is not B34 = 1/4 + 1/8 + 1/16 and B42 = 1 + 1/8 . Multiplying
needed for this design. A circuit that multiplies by a constant by B42 is easily implemented with a shifting step and addition
can be optimized as there is only one unknown input. This step as was done with previous values. However, multiplying
approach was used for the binary multiplication of an element by B34 will involve two adders. Using subtraction, we can
in Matrix A with an element of Matrix B. rewrite as B34 = 1 − 1/2 + 1/16 . Even though this result
To optimize the multiplication circuit, 16 different multi- contains one adder and one subtractor, there are two numbers
plication modules were created. Each module was optimized to shift rather than three numbers to shift. Therefore, B34 is
to multiply one of the elements in Matrix B with an input implemented with a subtractor for improved optimization.
number. The easiest element to implement was B11 , B41 = 1.
However, this module cannot be a simple wire because the C. Matrix Addition Module
input is a binary integer while the output is a fixed point binary Every element in result Matrix C is a sum of four products.
number. Therefore, the input needs to be shifted to the left by Equation (1) shows the required products for an element of
the fractional bit width. The fractional bit width is the number Matrix C for a given row (i is the row number). This equation
of bits to the right of the fixed point. This is needed to convert demonstrates that to calculate a row of Matrix C, only the same
the integer number input to the fixed point format. corresponding row is needed from Matrix A. To calculate row
The elements in Matrix B that are purely integer powers of 1 of Matrix C, only row 1 of Matrix A is needed; to calculate
two only involve bit shifting. These numbers are B12 = 1/8 , row 2 of Matrix C, only row 2 of Matrix A is needed, etc. All
16 elements of Matrix A do not need to be present in order assume that the BRAM outputs the following sequence of
to start computation. integer value sets: (16, 15), (14, 13), (12, 11), (10, 9), (8,
7), (6, 5), (4, 3), (2, 1). Two sets of intermediate registers,
Ci1 = Ai1 (B11 ) + Ai2 (B21 ) + Ai3 (B31 ) + Ai4 (B41 ) namely X1 and X2, alternate with the other two sets, namely
X3 and X4, to store the outputs of the BRAM. Thus, the first
Ci2 = Ai1 (B12 ) + Ai2 (B22 ) + Ai3 (B32 ) + Ai4 (B42 )
(1) output values 16 and 15 are each stored in X1 and X2. After
Ci3 = Ai1 (B13 ) + Ai2 (B23 ) + Ai3 (B33 ) + Ai4 (B43 ) one clock cycle, the next output values 14 and 13 are each
Ci4 = Ai1 (B14 ) + Ai2 (B24 ) + Ai4 (B33 ) + Ai4 (B44 ) stored in X3 and X4. Then these values in X1, X2, X3, and
X4 are stored in the output register sets, namely R1, R2, R3,
Taking advantage that only one row of Matrix A is needed and R4, respectively. This alternating storing of the values
to calculate one row of Matrix C, the matrix addition module is controlled by a 1-bit register named count, which flips its
only needs a four element input and a four element output. value every cycle to indicate which register sets must store the
If Matrix A row 1 is the input, then Matrix C row 1 is the BRAM output. Such process is well shown in figure 3.
output; if Matrix A row 2 is the input, then Matrix C row 2
is the output, etc. This module basically implements (1) with
Ai1 , Ai2 , Ai3 , and Ai4 as the inputs; and Ci1 , Ci2 , Ci3 , and
Ci4 as the outputs.
This approach would require four steps to output a single
matrix. Even though this affects the throughput, this approach
was used for two main reasons. The first reason is that the
area would be one fourth of the area if the circuit calculated
all 16 elements of Matrix C at the same time. This will help
the performance of the circuit. The second, and main reason,
is that fetching the elements of Matrix A from BRAM is
a bottleneck. The BRAM can only output a limited number
of Matrix A elements at a time. Therefore, to access all 16
elements of Matrix A form BRAM will take multiple clock
Figure 3. Fetch Module Testbench Waveform
cycles. Therefore, it is vital to start calculating Matrix C as
soon as the required elements are available. In this case, the As shown, whenever count is 1, the BRAM outputs are
first four elements of Matrix C can be calculated when the stored into X1 and X2. Whenever count is 0, the outputs are
first four elements of Matrix A are available, the second four stored into X3 and X4. However, this is not true until the first
elements of Matrix C can be calculated when the second four rising edge of count. Because the BRAM output values are 16
elements of Matrix A are available, etc. This strategy just and 15 until the second rising edge of clk after a reset (rst),
matches the throughput of the BRAM, since the BRAM cannot incorrect values will be stored in X3 and X4. Such problem
access all 16 elements of Matrix A at the same time, with the is resolved with a register named erase to make sure the value
added benefit of reducing area. storing process waits until the second rising edge of clk after
D. Fetch Routine the reset. Another problem is the delay of BRAM, which
results in storing of incorrect values. Another register named
A separate fetch module is designed in order to incorporate
en resolved this problem.
pipeline concept to our system. As described above, the matrix
multiplication is performed for each row at a time. Each III. I MPLEMENTATION
element in each row of Matrix A needs to be fetched, then
multiplication and addition are performed after awards. With-
out the pipeline design, this process is performed repetitively
in a sequence. In other words, the system will perform fetch,
calculation, fetch, calculation, and so on in a sequence. This is
a not optimal because the fetch module is not doing any work
while the calculations are being performed, and vice versa.
The pipeline in our design is explained more thoroughly in
section III.
We used a dual BRAM to fetch values for Matrix A. Since
the matrix multiplication is performed for one row at a time, Figure 4. System Block Diagram
four values need to be fetched from the dual BRAM while a
dual BRAM outputs two values at a time. In order to have The separate modules are placed together as shown in
four values to be fetched, the fetch module has four sets of Figure 4. The fetch routine cycles through the dual BRAM
intermediate registers and four sets of output registers. Let’s and outputs four elements of Matrix A at a time. After two
clock cycles the fetch block outputs the first four elements of period. Figure 5 shows a proposal to add addition pipeline to
Matrix A, after two more clock cycles the fetch block outputs a four input adder using three stages. In the first stage, input
the next four elements of Matrix A, etc. The output of the fetch A1 and A2 are added. In the second stage, input A3 is added
block feeds into the binary multiplier block. Each element is to the result of stage one. In the third stage, A4 is added to
multiplied by four different constants (Matrix B). There are 16 the result of stage two. Each stage is separated by sequential
outputs after the multiplier block that are fed to a summation logic registers. This differs from the initial design of one stage
block. The output of the summation block is a row of output pipeline summation block with only registers at the output.
Matrix C. The initial 11 bit output design with only one stage summa-
The system forms a pipeline from the BRAM block to tion block had a minimum clock period of 3.46ns. This allows
the output of the summation block. When the output of the a throughput of 3.612 × 107 matrices per second. The area is
summation block is Matrix C row 1, Matrix C row 2 is in calculated by adding all the registers, LUTs, and BRAM. Two
the summation phase, Matrix C row 3 is in the multiplier types of LUTS were calculated: Slice LUTs and LUT flip-flop
phase, and Matrix C row 4 is in the fetching phase. Pipelining pairs. This area was calculated to 551. The performance with
increases latency, but reduces the critical path delay between one stage summation block is 6.5567 × 104 .
registers. A lower critical delay path allows for faster clock Replacing one stage summation block with a three stage
frequencies. However, adding more in between registers, to summation block lowered the minimum period to 3.078ns.
reduce critical path delay, increases the circuit area. There is This increases the throughput to 4.061 × 107 matrices per
a balance needed with pipelining between clock speed and second. However, the area increased to 894. The performance
area. This is explored during optimization. reduces to 4.542 × 104 for the three stage summation block.
Even though increasing the pipeline registers speed up the
IV. O PTIMIZATION system, the increase in area makes the three stage summation
A. Performance block unattractive. Performance also goes down using a two
stage summation block. Therefore, the initial one stage sum-
throughput mation block was kept. After this exercise, it was determined
P erf ormance = (2) that increasing the pipelines within the system blocks would
area
not improve performance as the area increases more than
The first step to improved performance is to improve the
the throughput increases. Therefore, additional performance
bottleneck that is accessing BRAM. Equation (2) defines
improvements would have to reduce area instead of increasing
performance and throughput is the number of output matrices
throughput.
per second. The single output BRAM requires 16 clock cycles
In order to reduce the area, the number of registers, LUTs,
to access all 16 elements of Matrix A. However, using a dual
and LUT flip-flop pairs need to be reduced. The most no-
output BRAM allows two elements of Matrix A to be read
ticeable type of device among these in the verilog code is
per clock cycle. This doubles the throughput as now only 8
registers because the registers are usually declared in the code
clock cycles are needed to access all 16 elements of Matrix
in spite of Xilinx’s optimization of the number of registers
A. Implementing a dual output BRAM rather than using a
during synthesis process. The initial design of fetch routine
single output BRAM adds minimal area as the dual BRAM
had larger area because it had extra intermediate registers to
still counts as 1 BRAM. Thus, using the dual output BRAM
deal with the BRAM delay and the resulting incorrectness
is greatly preferred over the single output BRAM to improve
explained in section II-D above.
throughput.
The algorithm used in the fetch module also changed. The
initial design used data stream concept. Three sets of registers
are placed to hold output values from each output port of
the dual BRAM. Figure 6 shows this process. The output of
port a is first stored in XA3. During the next clock cycle, the
value in XA3 is stored in XA2 while the new output of port
a overwrites the value in XA3. Similarly, the value in XA2 is
stored in XA1 during the next clock cycle and the value in
XA3 overwrites the value in XA2 while the new output of port
a overwrites the value in XA3. XA1 and XA2 are connected
to the output registers. The port b output values are fetched in
the identical way.
The new fetch algorithm described in section II-D will
Figure 5. Three Stage Summation Pipeline not only reduce the area but also increase the speed. While
the initial fetch design has value assignment to each of
Throughput can also be improved by decreasing the mini- the intermediate register every clock cycle, the new design
mum clock period. Additional pipeline registers can be used alternates the value assignment to two sets of register. Thus,
to reduce the critical path delay to reduce the minimum clock a new value is assigned to each of the intermediate registers
the binary values. However, this slightly more complicated
multiplication algorithm requires more registers, LUTs, and
LUT flip-flop pairs.
In order to reduce the area, the number of bits can be
reduced. With the new binary multiplication algorithm, it is
possible to have fairly good accuracy with less number of
bits. 10-bit and 11-bit are chosen for testing. 10-bit results
in .0705% error and 11-bit result in .06% error. Since error
between .05% and .1% gives the same accuracy score, 10-
Figure 6. The Initial Design of Fetch Routine bit seems to be the better design. After synthesis process,
however, Xilinx throws in more registers and LUTs than
expected. The resulting area wasn’t reduced and hence the
once every two clock cycles. Nonetheless, the number of clock
performance did not improve. Therefore, the design is reverted
cycles taken to generate the output of the fetch routine stay the
to the one before this optimization without intermediate 19 bit
same. Less number of value assignment increases the speed
registers. All multiplication registers will have 13 bits.
of the fetch routine.
V. R ESULTS
B. Accuracy
1 0.125 3 0.25
47.25 109.29675 63.408333 85.75 0.75 1.5 0.375 2
34.25 78.243902 47.041667 61.75 [B 0 ] =
0.5
[C] = 5 0.453125 3
21.25 47.191057 30.675 37.75 1 1.125 0.25 0.75
8.25 16.138211 14.308333 13.75
Figure 8. Matrix B Estimation with 13 Bit Output
Figure 7. Matrix C Result in Double Format
47.25 109.125 63.21875 85.75
16 34.25 78.125 46.90625 61.75
1 X |V alf ix − V aldouble | [C 0 ] =
% error = 100% × (3) 21.25 47.125 30.59375 37.75
16 i=1 V aldouble
8.25 16.125 14.28125 13.75
The last optimization effort was regarding the binary multi- Figure 9. Matrix C Estimation with 13 Bit Output
plication. As describe in section II-A, more bits are chosen
to have better estimate of 7/15 and 140/123 . With 13 bit
output, 6 bits are used to represent fractional bits. Therefore, 3024 6984 4046 5488
7
/15 ≈ 0.011101 and 140/123 ≈ 1.001000. The percent error 2192 5000 3002 3952
64 × [C 0 ] =
1360
for 13 bit solution is 0.0982% calculated using (3) and would 3016 1958 2416
receive a accuracy grade of 0.3. The ideal Matrix C is shown 528 1032 914 880
in Figure 7 and was calculated using Matrix A of 16 down to Figure 10. Shifted Matrix C Estimation with 13 Bit Output
1. 13 bits was chosen over the initial 11 bits because the 11
bits error would received a grade of zero for accuracy, with The 13 bit output design has 7 bits to represent integer
error of 0.173%. Increasing the output bit size to 14 bit would values and 6 bits to represent fractional values. The estimation
increase the accuracy to grade 0.7 (error of 0.0411%), but matrix B 0 is shown in Figure 8. Only values B33 0 0
and B42
would decrease performance by 12% because of increase in needed to be estimated. However, these two estimation values
area. Since performance is weighted more, the 13 bit solution in Matrix B 0 propagate throughout the resulting Matrix C’
was chosen. shown in Figure 9. The simulation tool only displays the binary
To represent the values of the resulting matrix elements number seen at the output as it is oblivious to the fixed point
with fairly good accuracy, 13 bits are more than needed. notation. Since the fractional bit width is 6 bits, shifting Matrix
Therefore, using 13 bits just to have better estimate of these C 0 6 bits to the left (or multiplying by 64) will return the
two numbers does not seem to be reasonable. Instead of unsigned decimal value. Figure 10 shows the resulting shifted
declaring all values as 13 bit binary, intermediate registers matrix for simulation verification.
can be declared to estimate these number and multiply with The 13 bit output design has a minimum clock frequency of
better accuracy. For instance, the module that multiplies the 3.323ns as generated by the post place and route timing report.
input with 7/15 can have 19-bit intermediate register for the See Figure 11 for post-place and route static timing screenshot.
multiplication. The least significant six bits are truncated from One matrix is outputted every 8 clock cycles, therefore the
the 19-bit multiplication result, which is stored in the output throughput is 3.7617 × 107 matrices per second. Figure 12
registers. Therefore, there is an improvement of accuracy shows the device utilization summary for this design. This
while maintaining the same number of bits used to represent design uses 158 Slice Registers, 145 Slice LUTs, 199 LUT
Figure 11. PAR Timing