FPGA

This paper presents an FPGA design for 4x4 matrix multiplication, focusing on optimizing throughput, area, and accuracy through a balanced evaluation method. The design employs fixed-point notation for fractional binary numbers, binary multiplication, matrix addition, and a fetch routine, with a 3-stage pipeline and specific optimizations to enhance performance. Test results indicate that the optimal design effectively manages the trade-offs between accuracy and resource utilization, achieving efficient matrix operations on the FPGA.

Uploaded by

Rider Kamen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views6 pages

FPGA

Uploaded by

Rider Kamen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

FPGA Matrix Multiplier

In Hwan Baek David Boeck

Henri Samueli School of Henri Samueli School of
Engineering and Applied Science Engineering and Applied Science
University of California Los Angeles University of California Los Angeles
Los Angeles, California Los Angeles, California
Email: [email protected] Email: [email protected]

Abstract—This paper describes an FPGA design that performs increase the area because more registers are required to store
4x4 matrix multiplication. The goal of the design is to optimize more bits. Performance, defined as throughput divided by area,
throughput, area, and accuracy. The design of our matrix has a cost weight of 0.9. Accuracy has a cost weight of 0.1.
multiplier consists of four main parts: fractional binary numbers
(fixed point notation), binary multiplication, matrix addition, Using this evaluation method, the performance, the area, and
and fetch routine. Each part is designed and optimized to find the accuracy need to be balanced in order to achieve the best
the optimal balance among the throughput, the area, and the result.
accuracy. According to the test results, the design with the
optimal result used a 3-stage pipeline from the BRAM block II. S YSTEM D ESIGN
to the output of the summation block, 13-bit representation of
binary values, shifting and addition to replace multipliers, and The design of our matrix multiplier consists of four main
an inexpensive fetch module. parts: fractional binary numbers (fixed point notation), binary
multiplication, matrix addition, and fetch routine. These are
I. I NTRODUCTION
explained in the subsections below.
This paper describes an FPGA design that performs 4x4
matrix multiplication. The design is implemented with Virtex- A. Fractional Binary Numbers
5 using Xilinx ISE. A matrix with input integer values as its
Binary numbers in general represent integer values. Since
elements is multiplied with another matrix whose elements
Matrix B contains fractions and mixed numbers, the output can
have constant values as shown in Figure 1. For fetching input
be fractions and mixed numbers. Binary strings can represent
fractions and mixed numerals if explicitly defined beforehand.
The notation used for this project is fixed point notation.
For fixed point notation, a fixed point must be chosen to split
the binary string. The bits to the left of the fixed point represent
integer values, while the bits to the right of the fixed point
represent fractional values. Figure 2 shows a 7-bit example
with the fixed point before the 3rd LSB. As shown, the weights
of each bit are still powers of two, it is just that to the right
Figure 1. Maxtrix A x Martix B
of the fixed point the weights are inverse powers of two.
values, a good candidate is Xilinx Block RAM (BRAM). It is important to choose an appropriate fixed point lo-
Since the second matrix contains fractional values, the binary cation as it will affect accuracy of this system. The range
values must be able to represent fractional values. Although of Matrix A are integer values from 1 to 16. Therefore,
IEEE floating point is the standard representation, the design maximum value for Matrix C occurs when Matrix A is all
uses fixed point notation for enhanced performance, which is 16’s. The largest element from this matrix multiplication is
explained in more details in section II-A. 16 × 1/8 + 3/2 + 5 + 140/123 ≈ 124.21138. The integer
The number of bits chosen for the fixed point notation portion of 124 is represented in binary by 1111100. Thus, there
directly affects the accuracy, the area, and the throughput of must be 7 bits to the left of the floating point to represent this
the design. Because the fixed point notation only estimates maximum integer number.
the fractional values, the accuracy may not be perfect. The
throughput is given as the number of matrices calculated per
second. The area is given as the number of used devices such
as registers. The goal of the design is to optimize throughput,
area, and accuracy. There is a trade off between the three
criteria. For instance, increasing the number of bits to represent
the matrix element values will improve the accuracy, but it will Figure 2. 7 Bit Fixed Point Example
Each of the elements in Matrix B can be represented exactly B14 = 1/4 , B24 = 2, B31 = 1/2 , and B41 = 1/4 . Dividing by
by a 16-bit fixed point binary string except 7/15 or 140/123 . 8 is shifting input three bits to the right, 4 is shifting input
Thus, these numbers must be estimated which affects the two bits to the right, and 2 is shifting input one bit to the
accuracy of the system. Table I shows fixed point estimations right. Multiplying by 2 is shifting the input one bit to the
for 7/15 and 140/123 . left. However, as with multiplying by unity, the input is first
shifted by the fractional bit width. Multiplying an input by
Mixed Number Binary Fixed Point integer powers of two only requires one shifting step.
7/
15 0.011101110... Multiplying by integers other than powers of two adds an
140/ 1.001000110...
additional complexity. These elements are B13 , B34 = 3 and
123
B32 = 5. These integers can be rewritten as 3 = (2 + 1) and
Table I 5 = (4 + 1). These equations contain a power of 2 and adds 1.
M IXED N UMBER E STIMATION
To multiply a given input Aij by 3, first Aij is multiplied by
2 by shifting. Then Aij is added to the previous shifted result.
In fact, the entire 16 bit allowance is not needed to represent Likewise, the same concept is applied to multiplying by 5.
the other numbers in Matrix B be exactly. Only 3 bits to right These multiplication modules requires two steps to perform
of the fixed point are needed for the elements in Matrix B the operation. There is a shifting step and an addition step.
except for 7/15 or 140/123 . This allows the accuracy to depend The next step in complexity from multiplying by integers
only on 7/15 and 140/123 . Thus, the minimum bound for output are multiplying by fractions with powers of two in the de-
number size is 10 bits: 7 bits for the integer and 3 bits for the nominator. These numbers are B21 , B44 = 3/4 , B22 = 3/2 ,
fraction. To increase accuracy, the fraction portion can increase and B23 = 3/8 . These numbers can be expressed as a
to 9 bits. However, this will affect performance as the area will sum of inverse powers of two: 3/4 = 1/4 + 1/4 + 1/4 ,
3
increase. We chose 11 bits for the initial design to estimate /2 = 1 + 1/2 , and 3/8 = 1/8 + 1/8 + 1/8 . Adders in
7
/15 by 0.0111 and estimate 140/123 by 1.0010. There is a practice contain many combinational logic, and should be
trade off between performance and accuracy, and the number minimized to increase performance. Multiplying by 3/4 and
3
of bits used for the fractional portion is revisited during the /8 contains two adders. It is desirable to optimize 3/4 and
3
optimization phase. /8 to reduce the number of adders. Xilinx’s synthesis tool
contains primitives for adders and subtractors. If subtractor
B. Binary Multiplication primitives are used, then the following expressions can be

Binary multiplication is performed with one number from used: 3/4 = 1 − 1/4 , and 3/8 = 1/2 − 1/8 . Instead of using
Matrix A and one number from Matrix B. The value in Matrix two adders, one subtractor is used. To multiply by 3/8 , a given
A is an integer value from 1 to 16, while a value in Matrix input Aij is multiplied by 1/2 and 1/8 . The two results are
B is a fixed point binary number. The multiplication function then subtracted. The same concept is applied to 3/4 and 3/2 .
must also return a fixed point binary number of the same size Therefore, these multiplication modules contain two steps as
as the number from Matrix B. well. There is a shifting step and an addition/subtraction step.
A general binary multiplication circuit that takes two un- The last two elements of Matrix B, B34 = 7/15 and B42 =
known binary numbers and multiplies them together is very 140
/123 , must be estimated as discussed previously. The initial
complex. This circuit complexity will increase area size and design used 4 bits to represent the fractional portion. Therefore
delay. Both these issues degrade performance. However, Ma- from Table I, these numbers are estimated by truncation with
trix B is essentially a constant and can be hardcoded into the B34 = 0.0111 and B42 = 1.0010. With these estimations,
design. Therefore, a general binary multiplier circuit is not B34 = 1/4 + 1/8 + 1/16 and B42 = 1 + 1/8 . Multiplying
needed for this design. A circuit that multiplies by a constant by B42 is easily implemented with a shifting step and addition
can be optimized as there is only one unknown input. This step as was done with previous values. However, multiplying
approach was used for the binary multiplication of an element by B34 will involve two adders. Using subtraction, we can
in Matrix A with an element of Matrix B. rewrite as B34 = 1 − 1/2 + 1/16 . Even though this result
To optimize the multiplication circuit, 16 different multi- contains one adder and one subtractor, there are two numbers
plication modules were created. Each module was optimized to shift rather than three numbers to shift. Therefore, B34 is
to multiply one of the elements in Matrix B with an input implemented with a subtractor for improved optimization.
number. The easiest element to implement was B11 , B41 = 1.
However, this module cannot be a simple wire because the C. Matrix Addition Module
input is a binary integer while the output is a fixed point binary Every element in result Matrix C is a sum of four products.
number. Therefore, the input needs to be shifted to the left by Equation (1) shows the required products for an element of
the fractional bit width. The fractional bit width is the number Matrix C for a given row (i is the row number). This equation
of bits to the right of the fixed point. This is needed to convert demonstrates that to calculate a row of Matrix C, only the same
the integer number input to the fixed point format. corresponding row is needed from Matrix A. To calculate row
The elements in Matrix B that are purely integer powers of 1 of Matrix C, only row 1 of Matrix A is needed; to calculate
two only involve bit shifting. These numbers are B12 = 1/8 , row 2 of Matrix C, only row 2 of Matrix A is needed, etc. All
16 elements of Matrix A do not need to be present in order assume that the BRAM outputs the following sequence of
to start computation. integer value sets: (16, 15), (14, 13), (12, 11), (10, 9), (8,
7), (6, 5), (4, 3), (2, 1). Two sets of intermediate registers,
Ci1 = Ai1 (B11 ) + Ai2 (B21 ) + Ai3 (B31 ) + Ai4 (B41 ) namely X1 and X2, alternate with the other two sets, namely
X3 and X4, to store the outputs of the BRAM. Thus, the first
Ci2 = Ai1 (B12 ) + Ai2 (B22 ) + Ai3 (B32 ) + Ai4 (B42 )
(1) output values 16 and 15 are each stored in X1 and X2. After
Ci3 = Ai1 (B13 ) + Ai2 (B23 ) + Ai3 (B33 ) + Ai4 (B43 ) one clock cycle, the next output values 14 and 13 are each
Ci4 = Ai1 (B14 ) + Ai2 (B24 ) + Ai4 (B33 ) + Ai4 (B44 ) stored in X3 and X4. Then these values in X1, X2, X3, and
X4 are stored in the output register sets, namely R1, R2, R3,
Taking advantage that only one row of Matrix A is needed and R4, respectively. This alternating storing of the values
to calculate one row of Matrix C, the matrix addition module is controlled by a 1-bit register named count, which flips its
only needs a four element input and a four element output. value every cycle to indicate which register sets must store the
If Matrix A row 1 is the input, then Matrix C row 1 is the BRAM output. Such process is well shown in figure 3.
output; if Matrix A row 2 is the input, then Matrix C row 2
is the output, etc. This module basically implements (1) with
Ai1 , Ai2 , Ai3 , and Ai4 as the inputs; and Ci1 , Ci2 , Ci3 , and
Ci4 as the outputs.
This approach would require four steps to output a single
matrix. Even though this affects the throughput, this approach
was used for two main reasons. The first reason is that the
area would be one fourth of the area if the circuit calculated
all 16 elements of Matrix C at the same time. This will help
the performance of the circuit. The second, and main reason,
is that fetching the elements of Matrix A from BRAM is
a bottleneck. The BRAM can only output a limited number
of Matrix A elements at a time. Therefore, to access all 16
elements of Matrix A form BRAM will take multiple clock
Figure 3. Fetch Module Testbench Waveform
cycles. Therefore, it is vital to start calculating Matrix C as
soon as the required elements are available. In this case, the As shown, whenever count is 1, the BRAM outputs are
first four elements of Matrix C can be calculated when the stored into X1 and X2. Whenever count is 0, the outputs are
first four elements of Matrix A are available, the second four stored into X3 and X4. However, this is not true until the first
elements of Matrix C can be calculated when the second four rising edge of count. Because the BRAM output values are 16
elements of Matrix A are available, etc. This strategy just and 15 until the second rising edge of clk after a reset (rst),
matches the throughput of the BRAM, since the BRAM cannot incorrect values will be stored in X3 and X4. Such problem
access all 16 elements of Matrix A at the same time, with the is resolved with a register named erase to make sure the value
added benefit of reducing area. storing process waits until the second rising edge of clk after
D. Fetch Routine the reset. Another problem is the delay of BRAM, which
results in storing of incorrect values. Another register named
A separate fetch module is designed in order to incorporate
en resolved this problem.
pipeline concept to our system. As described above, the matrix
multiplication is performed for each row at a time. Each III. I MPLEMENTATION
element in each row of Matrix A needs to be fetched, then
multiplication and addition are performed after awards. With-
out the pipeline design, this process is performed repetitively
in a sequence. In other words, the system will perform fetch,
calculation, fetch, calculation, and so on in a sequence. This is
a not optimal because the fetch module is not doing any work
while the calculations are being performed, and vice versa.
The pipeline in our design is explained more thoroughly in
section III.
We used a dual BRAM to fetch values for Matrix A. Since
the matrix multiplication is performed for one row at a time, Figure 4. System Block Diagram
four values need to be fetched from the dual BRAM while a
dual BRAM outputs two values at a time. In order to have The separate modules are placed together as shown in
four values to be fetched, the fetch module has four sets of Figure 4. The fetch routine cycles through the dual BRAM
intermediate registers and four sets of output registers. Let’s and outputs four elements of Matrix A at a time. After two
clock cycles the fetch block outputs the first four elements of period. Figure 5 shows a proposal to add addition pipeline to
Matrix A, after two more clock cycles the fetch block outputs a four input adder using three stages. In the first stage, input
the next four elements of Matrix A, etc. The output of the fetch A1 and A2 are added. In the second stage, input A3 is added
block feeds into the binary multiplier block. Each element is to the result of stage one. In the third stage, A4 is added to
multiplied by four different constants (Matrix B). There are 16 the result of stage two. Each stage is separated by sequential
outputs after the multiplier block that are fed to a summation logic registers. This differs from the initial design of one stage
block. The output of the summation block is a row of output pipeline summation block with only registers at the output.
Matrix C. The initial 11 bit output design with only one stage summa-
The system forms a pipeline from the BRAM block to tion block had a minimum clock period of 3.46ns. This allows
the output of the summation block. When the output of the a throughput of 3.612 × 107 matrices per second. The area is
summation block is Matrix C row 1, Matrix C row 2 is in calculated by adding all the registers, LUTs, and BRAM. Two
the summation phase, Matrix C row 3 is in the multiplier types of LUTS were calculated: Slice LUTs and LUT flip-flop
phase, and Matrix C row 4 is in the fetching phase. Pipelining pairs. This area was calculated to 551. The performance with
increases latency, but reduces the critical path delay between one stage summation block is 6.5567 × 104 .
registers. A lower critical delay path allows for faster clock Replacing one stage summation block with a three stage
frequencies. However, adding more in between registers, to summation block lowered the minimum period to 3.078ns.
reduce critical path delay, increases the circuit area. There is This increases the throughput to 4.061 × 107 matrices per
a balance needed with pipelining between clock speed and second. However, the area increased to 894. The performance
area. This is explored during optimization. reduces to 4.542 × 104 for the three stage summation block.
Even though increasing the pipeline registers speed up the
IV. O PTIMIZATION system, the increase in area makes the three stage summation
A. Performance block unattractive. Performance also goes down using a two
stage summation block. Therefore, the initial one stage sum-
throughput mation block was kept. After this exercise, it was determined
P erf ormance = (2) that increasing the pipelines within the system blocks would
area
not improve performance as the area increases more than
The first step to improved performance is to improve the
the throughput increases. Therefore, additional performance
bottleneck that is accessing BRAM. Equation (2) defines
improvements would have to reduce area instead of increasing
performance and throughput is the number of output matrices
throughput.
per second. The single output BRAM requires 16 clock cycles
In order to reduce the area, the number of registers, LUTs,
to access all 16 elements of Matrix A. However, using a dual
and LUT flip-flop pairs need to be reduced. The most no-
output BRAM allows two elements of Matrix A to be read
ticeable type of device among these in the verilog code is
per clock cycle. This doubles the throughput as now only 8
registers because the registers are usually declared in the code
clock cycles are needed to access all 16 elements of Matrix
in spite of Xilinx’s optimization of the number of registers
A. Implementing a dual output BRAM rather than using a
during synthesis process. The initial design of fetch routine
single output BRAM adds minimal area as the dual BRAM
had larger area because it had extra intermediate registers to
still counts as 1 BRAM. Thus, using the dual output BRAM
deal with the BRAM delay and the resulting incorrectness
is greatly preferred over the single output BRAM to improve
explained in section II-D above.
throughput.
The algorithm used in the fetch module also changed. The
initial design used data stream concept. Three sets of registers
are placed to hold output values from each output port of
the dual BRAM. Figure 6 shows this process. The output of
port a is first stored in XA3. During the next clock cycle, the
value in XA3 is stored in XA2 while the new output of port
a overwrites the value in XA3. Similarly, the value in XA2 is
stored in XA1 during the next clock cycle and the value in
XA3 overwrites the value in XA2 while the new output of port
a overwrites the value in XA3. XA1 and XA2 are connected
to the output registers. The port b output values are fetched in
the identical way.
The new fetch algorithm described in section II-D will
Figure 5. Three Stage Summation Pipeline not only reduce the area but also increase the speed. While
the initial fetch design has value assignment to each of
Throughput can also be improved by decreasing the mini- the intermediate register every clock cycle, the new design
mum clock period. Additional pipeline registers can be used alternates the value assignment to two sets of register. Thus,
to reduce the critical path delay to reduce the minimum clock a new value is assigned to each of the intermediate registers
the binary values. However, this slightly more complicated
multiplication algorithm requires more registers, LUTs, and
LUT flip-flop pairs.
In order to reduce the area, the number of bits can be
reduced. With the new binary multiplication algorithm, it is
possible to have fairly good accuracy with less number of
bits. 10-bit and 11-bit are chosen for testing. 10-bit results
in .0705% error and 11-bit result in .06% error. Since error
between .05% and .1% gives the same accuracy score, 10-
Figure 6. The Initial Design of Fetch Routine bit seems to be the better design. After synthesis process,
however, Xilinx throws in more registers and LUTs than
expected. The resulting area wasn’t reduced and hence the
once every two clock cycles. Nonetheless, the number of clock
performance did not improve. Therefore, the design is reverted
cycles taken to generate the output of the fetch routine stay the
to the one before this optimization without intermediate 19 bit
same. Less number of value assignment increases the speed
registers. All multiplication registers will have 13 bits.
of the fetch routine.
V. R ESULTS
B. Accuracy
 
  1 0.125 3 0.25
47.25 109.29675 63.408333 85.75 0.75 1.5 0.375 2 
34.25 78.243902 47.041667 61.75 [B 0 ] = 
 0.5

[C] =   5 0.453125 3 
21.25 47.191057 30.675 37.75 1 1.125 0.25 0.75
8.25 16.138211 14.308333 13.75
Figure 8. Matrix B Estimation with 13 Bit Output
Figure 7. Matrix C Result in Double Format
 
47.25 109.125 63.21875 85.75
16 34.25 78.125 46.90625 61.75
1 X |V alf ix − V aldouble | [C 0 ] =  
% error = 100% × (3) 21.25 47.125 30.59375 37.75
16 i=1 V aldouble
8.25 16.125 14.28125 13.75
The last optimization effort was regarding the binary multi- Figure 9. Matrix C Estimation with 13 Bit Output
plication. As describe in section II-A, more bits are chosen
to have better estimate of 7/15 and 140/123 . With 13 bit  
output, 6 bits are used to represent fractional bits. Therefore, 3024 6984 4046 5488
7
/15 ≈ 0.011101 and 140/123 ≈ 1.001000. The percent error 2192 5000 3002 3952
64 × [C 0 ] = 
1360

for 13 bit solution is 0.0982% calculated using (3) and would 3016 1958 2416
receive a accuracy grade of 0.3. The ideal Matrix C is shown 528 1032 914 880
in Figure 7 and was calculated using Matrix A of 16 down to Figure 10. Shifted Matrix C Estimation with 13 Bit Output
1. 13 bits was chosen over the initial 11 bits because the 11
bits error would received a grade of zero for accuracy, with The 13 bit output design has 7 bits to represent integer
error of 0.173%. Increasing the output bit size to 14 bit would values and 6 bits to represent fractional values. The estimation
increase the accuracy to grade 0.7 (error of 0.0411%), but matrix B 0 is shown in Figure 8. Only values B33 0 0
and B42
would decrease performance by 12% because of increase in needed to be estimated. However, these two estimation values
area. Since performance is weighted more, the 13 bit solution in Matrix B 0 propagate throughout the resulting Matrix C’
was chosen. shown in Figure 9. The simulation tool only displays the binary
To represent the values of the resulting matrix elements number seen at the output as it is oblivious to the fixed point
with fairly good accuracy, 13 bits are more than needed. notation. Since the fractional bit width is 6 bits, shifting Matrix
Therefore, using 13 bits just to have better estimate of these C 0 6 bits to the left (or multiplying by 64) will return the
two numbers does not seem to be reasonable. Instead of unsigned decimal value. Figure 10 shows the resulting shifted
declaring all values as 13 bit binary, intermediate registers matrix for simulation verification.
can be declared to estimate these number and multiply with The 13 bit output design has a minimum clock frequency of
better accuracy. For instance, the module that multiplies the 3.323ns as generated by the post place and route timing report.
input with 7/15 can have 19-bit intermediate register for the See Figure 11 for post-place and route static timing screenshot.
multiplication. The least significant six bits are truncated from One matrix is outputted every 8 clock cycles, therefore the
the 19-bit multiplication result, which is stored in the output throughput is 3.7617 × 107 matrices per second. Figure 12
registers. Therefore, there is an improvement of accuracy shows the device utilization summary for this design. This
while maintaining the same number of bits used to represent design uses 158 Slice Registers, 145 Slice LUTs, 199 LUT
Figure 11. PAR Timing

Flip-Flop Pairs, and 1 BRAM. When counting both Slice

LUTs and LUT Flip-Flop Pairs as different LUTs the total
area is 552. This calculates to a performance of 6.8146 × 104 .
The design summary also states that there are 54 bonded IOs.
This corresponds to the 52 outputs (4×13) and 2 inputs (rst
and clk).
To verify the design, the post place and route simulation
model was generated. Figure 13 shows the simulation results
with a clock frequency of 3.4ns. There is a latency from when
reset goes high to when the circuit outputs the first values.
However, this is a one-time latency due to initial pipelining
and is ignored in the throughput calculation. The first stable
output at around 126ns is the first row of Matrix C with R1 =
C11 , R2 = C12 , R3 = C13 , and R4 = C14 . After two clock
cycles the next stable output is the second row of Matrix C
with R1 = C21 , R2 = C22 , R3 = C23 , and R4 = C24 . As
follows, the next two rows of Matrix C are the next two stable
outputs. This simulation results match the expected estimation
matrix shown in Figure 10. The fetch routine then loops back
to the first address of BRAM and the output repeats outputting
Matrix C indefinitely.
The simulation uses a clock frequency of 3.4ns, which is Figure 12. Device Utilization
2% within the minimum clock frequency generated by the
post place and route timing report. The design successfully
works at the desired minimum frequency. The two markers in
Figure 13 show the beginning of a matrix and end of a matrix
and are separated by 27.2ns. This is 8 clock cycles and verifies
that a matrix is outputted every 8 clock cycles.

VI. C ONCLUSION Figure 13. Simulation Results

The goal of the FPGA matrix multiplier design was to

achieve the optimal accuracy and performance, which are from the BRAM block to the output of the summation block,
measured with the area and the throughput. The trade off 13-bit representation of binary values, shifting and addition to
among these criteria introduced the main challenge of finding a replace multipliers, and an inexpensive fetch module.
balance for optimal result. In the overall system level, pipeline
concept is used to achieve high throughput. Optimization pro- ACKNOWLEDGMENT
cess involved inspection of accuracy and performance change The authors would like to thank Prof. Lei He, Tianheng Tu,
with respect to the number of bits used for the fixed point and Zhuo Jia.
notation and the number registers declared in each module’s
algorithm. Different number of pipeline levels are tested and R EFERENCES
the optimal level is chosen. 10-bit, 11-bit, 13-bit, 14-bit [1] LogiCORE IP Block Memory Generator v7.1, 2012.
[2] H. So. (2006, Feb. 28). Introduction to Fixed Point Number Representa-
designs are implemented to find the optimal number of bits to tion [Online]. Availible:
represent the binary values in fixed point notation. Different https://inst.eecs.berkeley.edu/ cs61c/sp06/handout/fixedpt.html
algorithms for fetch routines and binary multiplication are [3] D. K. Tala (2014 Feb. 9). Verilog Tutorial - World of Asic [Online].
Availible:
designed and tested for the best result. According to the test http://www.asic-world.com/verilog/veritut.html
results, the design with the optimal result used one pipeline

Grade 11 Mathematics RELAB (Term1 - Term 4) Learner Booklet PDF
100% (8)
Grade 11 Mathematics RELAB (Term1 - Term 4) Learner Booklet PDF
163 pages
Geometry: Grades
100% (1)
Geometry: Grades
108 pages
A Matrix-Multiply Unit For Posits in Reconfigurable Logic Leveraging (Open) CAPI
No ratings yet
A Matrix-Multiply Unit For Posits in Reconfigurable Logic Leveraging (Open) CAPI
9 pages
2174 PDF
No ratings yet
2174 PDF
7 pages
Design of Low-Area and High Speed Pipelined
No ratings yet
Design of Low-Area and High Speed Pipelined
6 pages
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
No ratings yet
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
8 pages
Bu 33436438
No ratings yet
Bu 33436438
3 pages
Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas
No ratings yet
Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas
5 pages
Hybrid FP FXP Dot Product
No ratings yet
Hybrid FP FXP Dot Product
12 pages
DSP Arithmetic: Ece 450:digital Signal Processors and Applications Processors and Applications
No ratings yet
DSP Arithmetic: Ece 450:digital Signal Processors and Applications Processors and Applications
23 pages
Proposal Presentation
No ratings yet
Proposal Presentation
22 pages
Fixed Point Math F-Lemieu
No ratings yet
Fixed Point Math F-Lemieu
5 pages
Lecture 5 - Programming Issues
No ratings yet
Lecture 5 - Programming Issues
12 pages
Design and Implementation of A High-Speed Matrix Multiplier Based On Word-Width Decomposition
No ratings yet
Design and Implementation of A High-Speed Matrix Multiplier Based On Word-Width Decomposition
13 pages
Design and Simulation of Radix-8 Booth Multiplier For Signed and Unsigned Numbers Using VHDL
No ratings yet
Design and Simulation of Radix-8 Booth Multiplier For Signed and Unsigned Numbers Using VHDL
51 pages
Implementation of Floating Point Multiplier
No ratings yet
Implementation of Floating Point Multiplier
4 pages
Floating-Point Multiplication Unit With 16-Bit Significant and 8-Bit Exponent
No ratings yet
Floating-Point Multiplication Unit With 16-Bit Significant and 8-Bit Exponent
6 pages
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
No ratings yet
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
7 pages
COD - Unit-3 - N - 4 - PPT AJAY Kumar
No ratings yet
COD - Unit-3 - N - 4 - PPT AJAY Kumar
93 pages
SW Lab 3 Fixed Point Simulation EE 462
No ratings yet
SW Lab 3 Fixed Point Simulation EE 462
7 pages
An FPGA Implementation of High Speed and Area Efficient Double-Precision Floating Point Multiplier Using Urdhva Tiryagbhyam Technique
No ratings yet
An FPGA Implementation of High Speed and Area Efficient Double-Precision Floating Point Multiplier Using Urdhva Tiryagbhyam Technique
6 pages
31 Design JJ New
No ratings yet
31 Design JJ New
8 pages
Design, Comparison and Implementation of Multipliers On FPGA
No ratings yet
Design, Comparison and Implementation of Multipliers On FPGA
8 pages
Design of Single Precision Floating Point Multiplication Algorithm With Vector Support
No ratings yet
Design of Single Precision Floating Point Multiplication Algorithm With Vector Support
8 pages
Efficient Implementation of Pipelined Double Precision Floating Point Unit On FPGA
No ratings yet
Efficient Implementation of Pipelined Double Precision Floating Point Unit On FPGA
6 pages
Abstract
No ratings yet
Abstract
22 pages
FPGA Design of A Fast 32-Bit Floating Point
No ratings yet
FPGA Design of A Fast 32-Bit Floating Point
3 pages
Design and Implementation of FPU For Optimised Speed: R. Bhuvanapriya, Menakadevi T
No ratings yet
Design and Implementation of FPU For Optimised Speed: R. Bhuvanapriya, Menakadevi T
12 pages
White Paper One VHDL Maths 2008
No ratings yet
White Paper One VHDL Maths 2008
5 pages
Design and Implementation of Fast Floating Point Multiplier Unit
No ratings yet
Design and Implementation of Fast Floating Point Multiplier Unit
5 pages
Floating Point & Fixed Point Representation - BCA II
No ratings yet
Floating Point & Fixed Point Representation - BCA II
24 pages
8159-Article Text-14636-1-10-20210604
No ratings yet
8159-Article Text-14636-1-10-20210604
8 pages
Distributed Arithmetic For The Design of High Speed Fir Filter Using Fpgas
No ratings yet
Distributed Arithmetic For The Design of High Speed Fir Filter Using Fpgas
9 pages
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
7 pages
Design of Floating Point Multiplier Using Vedic Aphorisms: Pratiksha Rai, Shailendra Kumar, Prof. (DR.) S.H.Saeed
No ratings yet
Design of Floating Point Multiplier Using Vedic Aphorisms: Pratiksha Rai, Shailendra Kumar, Prof. (DR.) S.H.Saeed
4 pages
An Approach To LUT Based Multiplier For Short
No ratings yet
An Approach To LUT Based Multiplier For Short
5 pages
Lab 1
100% (1)
Lab 1
10 pages
Design & Simulation of 32-Bit Floating Point Alu
No ratings yet
Design & Simulation of 32-Bit Floating Point Alu
3 pages
Cacc
No ratings yet
Cacc
106 pages
Erle Mult Carrysave
No ratings yet
Erle Mult Carrysave
11 pages
Lecture-ASIP DSP Implementation
No ratings yet
Lecture-ASIP DSP Implementation
49 pages
Electronics 12 00605 v2
No ratings yet
Electronics 12 00605 v2
19 pages
Computer Arithmetic: Multiplication Algorithms Division Algorithms Floating-Point Arithmetic Operations
No ratings yet
Computer Arithmetic: Multiplication Algorithms Division Algorithms Floating-Point Arithmetic Operations
70 pages
Design and Implementation of Floating Point Multiplier Using Wallace and Dadda Algorithm
No ratings yet
Design and Implementation of Floating Point Multiplier Using Wallace and Dadda Algorithm
6 pages
Lecture9 - Fixed Point
No ratings yet
Lecture9 - Fixed Point
36 pages
DSD - Oe - Project - PPT (1) - e Div
No ratings yet
DSD - Oe - Project - PPT (1) - e Div
26 pages
2.4 Floating Points
No ratings yet
2.4 Floating Points
36 pages
Synopsis and Literature Survey
No ratings yet
Synopsis and Literature Survey
10 pages
HDL Report Group3
No ratings yet
HDL Report Group3
18 pages
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
6 pages
Shi Wal 95 A
No ratings yet
Shi Wal 95 A
8 pages
Lab # 06 PDF
No ratings yet
Lab # 06 PDF
12 pages
Digital Signal Processing: Date: 17/08/2017
No ratings yet
Digital Signal Processing: Date: 17/08/2017
27 pages
Floating-Point To Fixed-Point Code Conversion With Variable Trade-Off Between Computational Complexity and Accuracy Loss
No ratings yet
Floating-Point To Fixed-Point Code Conversion With Variable Trade-Off Between Computational Complexity and Accuracy Loss
6 pages
VLSI Implementation of Bit Serial Architecture Based Multiplier in Floating Point Arithmetic
No ratings yet
VLSI Implementation of Bit Serial Architecture Based Multiplier in Floating Point Arithmetic
6 pages
Fang F. - Lightweight Floating-Point Arithmetic - Case Study of IDCT
No ratings yet
Fang F. - Lightweight Floating-Point Arithmetic - Case Study of IDCT
13 pages
Double Precision Floating Point Multiplier
No ratings yet
Double Precision Floating Point Multiplier
3 pages
EC-502 - Aritra Dutta
No ratings yet
EC-502 - Aritra Dutta
6 pages
Module 2
No ratings yet
Module 2
19 pages
Floating Point Multipliers: Simulation & Synthesis Using VHDL
No ratings yet
Floating Point Multipliers: Simulation & Synthesis Using VHDL
40 pages
Fpga Implementation of FFT Algorithms Using Floating
No ratings yet
Fpga Implementation of FFT Algorithms Using Floating
5 pages
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
From Everand
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
Fouad Sabry
No ratings yet
08 Odds Ends
No ratings yet
08 Odds Ends
27 pages
Dll-Math 9 Week 6 Sy 2022-2023
No ratings yet
Dll-Math 9 Week 6 Sy 2022-2023
9 pages
Slide Chapter 4
No ratings yet
Slide Chapter 4
31 pages
Topology and Data: Gunnar Carlsson
No ratings yet
Topology and Data: Gunnar Carlsson
54 pages
Practice Question For Algebra Chapter 1
No ratings yet
Practice Question For Algebra Chapter 1
2 pages
11 Asat Syllabus
0% (1)
11 Asat Syllabus
1 page
2016 Australian Mathematics Competition AMC Intermediate Years 9 and 10 - Solutions
No ratings yet
2016 Australian Mathematics Competition AMC Intermediate Years 9 and 10 - Solutions
9 pages
VECTOR 5 - Dot Product
No ratings yet
VECTOR 5 - Dot Product
5 pages
3 Industry Calculations 3
No ratings yet
3 Industry Calculations 3
24 pages
WS 1.1.6 - Sets & Venn Diagram
No ratings yet
WS 1.1.6 - Sets & Venn Diagram
2 pages
LA3 Numerical Differentiation and Integration Numerical Differentiation
No ratings yet
LA3 Numerical Differentiation and Integration Numerical Differentiation
4 pages
Pseudocode - Edexcel
No ratings yet
Pseudocode - Edexcel
8 pages
Grade 9 Mathematics 2016
No ratings yet
Grade 9 Mathematics 2016
8 pages
Linear Inequalities
100% (2)
Linear Inequalities
4 pages
Find Z-Transform Of: Solution: Since
No ratings yet
Find Z-Transform Of: Solution: Since
8 pages
PC Solve Linear Systems
100% (1)
PC Solve Linear Systems
15 pages
Introduction To Probability PPT 1 Final
No ratings yet
Introduction To Probability PPT 1 Final
71 pages
Crux v13n07 Sep
No ratings yet
Crux v13n07 Sep
37 pages
2021-P5-Maths-Weighted Assessment 1-SCGS
No ratings yet
2021-P5-Maths-Weighted Assessment 1-SCGS
13 pages
9 Times 4 6 Times 6 Understanding The Quantum Solu
No ratings yet
9 Times 4 6 Times 6 Understanding The Quantum Solu
26 pages
2022 Y5 Term 3 CT Revision Paper 1 (Soln)
No ratings yet
2022 Y5 Term 3 CT Revision Paper 1 (Soln)
12 pages
Learning-Plan-GENERAL MATHEMATICS 11 Lesson 9 and 10
No ratings yet
Learning-Plan-GENERAL MATHEMATICS 11 Lesson 9 and 10
4 pages
Eureka Math Parent Tips Grade 4 Module 2
No ratings yet
Eureka Math Parent Tips Grade 4 Module 2
2 pages
03 Programming in Haskell - Chapter 9
No ratings yet
03 Programming in Haskell - Chapter 9
11 pages
Understanding Surds Teaching Ideas
No ratings yet
Understanding Surds Teaching Ideas
1 page
Adaptive Control of Twin Rotor Mimo System: Polynomial Approach
No ratings yet
Adaptive Control of Twin Rotor Mimo System: Polynomial Approach
6 pages
Unit 1 Types of Categorical Propositions: A, E, I, O and Square of Opposition
No ratings yet
Unit 1 Types of Categorical Propositions: A, E, I, O and Square of Opposition
70 pages
Rigid Dynamics Vol-II (Analytical Dynamics)
100% (2)
Rigid Dynamics Vol-II (Analytical Dynamics)
404 pages

FPGA

Uploaded by

FPGA

Uploaded by

FPGA Matrix Multiplier

In Hwan Baek David Boeck

Flip-Flop Pairs, and 1 BRAM. When counting both Slice

VI. C ONCLUSION Figure 13. Simulation Results

The goal of the FPGA matrix multiplier design was to

You might also like