Matrix Vector Multiplier Uart System
Matrix Vector Multiplier Uart System
UART SYSTEM
EKANAYAKE E.M.M.U.B.
981011031V
INTRODUCTION
Here an r x c matrix is multiplied with an c x 1 vector to obtain a r x 1 vector.
Figure 1
Let’s consider the dot product of rth row in matrix K with the vector X. The result
gives a the rth element in output vector Y.
In this situation, after multiplying each element in the row with the correspondent
element of the column, the products should be added together to get each element.
Here the addition of all the products cannot be done at once, therefore by adding 2
elements separately at once, we can reach the final outcome. For that we have to
implement an Adder Tree as shown in figure 2.
1
Figure 2
Here the DEPTH is how many levels the adder tree goes. In each level of that tree,
the additions are done simultaneously. The sums given by every adder in each level
are then transferred to the next level, then those sums are added again goes so on.
Finally, the output comes from the root adder is the final output.
Figure 3
2
Here, a is the index of each adder in a single level of the adder tree.
Figure 4
3
BLOCK DESIGN
Figure 5
First convert UART RX data comes from PC to AXI Stream data by a converter.
Then that data is input to matrix vector multiplier. After processing that data, the
output is sent to skid buffer if there is a delay in output. Then that output is finally
converted to UART TX data again and send it to
4
RTL EXPLANATION
RTL is comprised with 6 code files.
Figure 6
module matvec_mul #(
parameter R=8, // R - Number of Rows in matrix and size of input vector,
C=8, // C - Number of columns in matrix and size of output vector
W_X=8, // Width of an element in the input vector
W_K=8, // Width of an element in the matrix
localparam DEPTH = $clog2(C), // DEPTH of the addition tree goes
W_M = W_X + W_K,
W_Y = W_M + DEPTH // Width of an element in the product by
//multiplying element of input
// vector with multiplying element of matrix
)(
input logic clk, cen, // clock and clock enable
input logic signed [R-1:0][C-1:0][W_K-1:0] k, // input matrix R * C
input logic signed [C-1:0][W_X-1:0] x, // input vector R * 1
output logic signed [R-1:0][W_Y-1:0] y // output vector C * 1
);
5
// Padding
localparam C_PAD = 2**$clog2(C); // padding amount of matrix
logic signed [W_Y-1:0] tree [R][DEPTH+1][C_PAD]; // #R number of adder trees
wire signed [C_PAD-1:0][W_X-1:0] x_pad = {'0, x}; // padding the input vector by
//adding extra 0s
wire signed [R-1:0][C_PAD-1:0][W_K-1:0] k_pad; // padding the matrix
genvar r, c, d, a;
for (r=0; r<R; r=r+1) begin
assign k_pad[r] = {'0, k[r]}; // adding extra 0s to pad the matrix in each tree
Figure 7
6
Here there are 2 states for the skid buffer. The Empty state and the Full state.
In Empty state, skid buffer allows to store data, while in Full state the AXI
stream should take the buffer data as input first.
Figure 8
7
// if reset is done, state is EMPTY, open to store data
// in buffer
else if (m_ready || s_ready) state <= state_next;
// else if m_ready or s_ready is triggered, state equals next state.
8
AXI Stream Matrix Vector Multiplier (axis_matvec_mul.v)
This is the wrapper which use to integrate matvec_mul.sv and skid_buffer.sv.
This is written in old Verilog. This is the top module over matrix vector
multiplier and skid buffer.
These top modules (or wrappers) are written in old Verilog in order to avoid
multi-dimensional tolks.
Figure 9
Figure 9 shows the block diagram of this module. Here, the clock enable is
same the slave ready signal from skid buffer. It means if skid buffer is not
ready to take data, matrix vector multiplier stops doing the job.
This integrated circuit has AXI signals.
Here LATENCY plays a significant role. LATENCY is the how many clock cycles it
takes to output data from Matvec Mul. (LATENCY = DEPTH + 1)
Simultaneously, s_valid also should be late by the same time. So it is shifted.
9
module axis_matvec_mul #(
parameter R=8, // R - Number of Rows in matrix and size of input vector,
C=8, // C - Number of columns in matrix and size of output vector
W_X=8, // Width of an element in the input vector
W_K=8, // Width of an element in the matrix
LATENCY = $clog2(C)+1, // How many clock cycles does it take to output
//after giving input
W_Y = W_X + W_K + $clog2(C) // Width of an element in the product by
// multiplying element of input
// vector with multiplying element of matrix
)(
input clk, rstn,
output s_axis_kx_tready, // slave ready
input s_axis_kx_tvalid, // slave valid
input [R*C*W_K + C*W_X -1:0] s_axis_kx_tdata, // whole input in 1 flat array
// This contains input vector and
// input matrix
input m_axis_y_tready, // master ready
output m_axis_y_tvalid, // master valid
output [R*W_Y -1:0] m_axis_y_tdata // output data
);
wire [R*C*W_K-1:0] k; // matrix
wire [C*W_X -1:0] x; // input vecor
assign {k, x} = s_axis_kx_tdata; // assigning input matrix and vector from input
// flat array.
wire [R*W_Y-1:0] i_data; // data bus to skid buffer from matrix vector multiplier
wire i_ready; // data ready bus from skid buffer to matrix vector multiplier
assign s_axis_kx_tready = i_ready; // slave ready is the same slave ready of skid
// buffer as in figure 9
endmodule
Figure 10
11
Before starting the process, UART RX is 1.
To start the process, make RX to 0. Then after 1.5 times the clocks per pulse
(= Clock frequency of the FPGA board / baud rate a.k.a bits per second), send
the first bit.
Thereafter, send bit by bit after each clocks per pulse.
After sending all the 8 bits, send the parity bit if want as in the same way.
Then to stop the process, make RX = 1.
Figure 10 shows the state machine.
Figure 11
module uart_rx #(
parameter CLOCKS_PER_PULSE = 4, //200_000_000/9600 baud rate => 9600, clock
//frequency of fpga => 200_000_000
BITS_PER_WORD = 8, // bits in a word
W_OUT = 24 //R*C*W_K + C*W_X, // input to axis_matvec_mul.v
)(
input logic clk, rstn, rx,
output logic m_valid,
output logic [W_OUT-1:0] m_data
12
);
localparam NUM_WORDS = W_OUT/BITS_PER_WORD; // number of words
// Counters
// State Machine
if (!rstn) begin
{c_words, c_bits, c_clocks, m_valid, m_data} <= '0; // reset clock count, bits
//count, word count and
// master valid and output
//data to 0s
state <= IDLE; // when reset, state is IDLE
end else begin
m_valid <= 0; // in other clock positive edges, first make master valid 0
case (state)
13
c_bits <= 0; // bit count is set to 0
Figure 12
module uart_tx #(
15
genvar n;
for (n=0; n<NUM_WORDS; n=n+1)
assign s_packets[n] = { ~(END_BITS'(0)), s_data[n], 1'b0}; //encapsulation
assign tx = m_packets[0]; //assign the first bit in flat array into tx, the output
// Counters
logic [$clog2(NUM_WORDS*PACKET_SIZE)-1:0] c_pulses; // number of pulses
logic [$clog2(CLOCKS_PER_PULSE) -1:0] c_clocks; // number of clocks
// State Machine
if (!rstn) begin
state <= IDLE; // when reset, state is IDLE
m_packets <= '1; // initiate all the data to be sent to 1
{c_pulses, c_clocks} <= 0; // set number of clocks and pulses to 0
end else
case (state)
IDLE : if (s_valid) begin // if data output from matrix vector multiplier is
//valid
state <= SEND; // switch to SEND state
m_packets <= s_packets; // rearrange the separated words into 1
//single flat array
end
16
end
module mvm_uart_system #(
parameter CLOCKS_PER_PULSE = 200_000_000/9600, //200_000_000/9600
BITS_PER_WORD = 8,
PACKET_SIZE_TX = BITS_PER_WORD + 5,
W_Y_OUT = 32,
R=8, C=8, W_X=8, W_K=8
)(
input clk, rstn, rx,
output tx
);
uart_rx #(
.CLOCKS_PER_PULSE (CLOCKS_PER_PULSE),
.BITS_PER_WORD (BITS_PER_WORD),
.W_OUT (W_BUS_KX)
) UART_RX (
.clk (clk),
.rstn (rstn),
.rx (rx),
.m_valid(s_valid),
.m_data (s_data_kx)
);
genvar r;
for (r=0; r<R; r=r+1) begin
assign y_up [r] = m_data_y[W_Y*(r+1)-1 : W_Y*r];
assign o_up [r] = $signed(y_up[r]); // sign extend to 32b
assign o_flat[W_Y_OUT*(r+1)-1 : W_Y_OUT*r] = o_up[r];
// assign o_flat[W_Y_OUT*(r+1)-1 : W_Y_OUT*r] = $signed(m_data_y[W_Y*(r+1)-1 :
W_Y*r]);
end
uart_tx #(
.CLOCKS_PER_PULSE (CLOCKS_PER_PULSE),
.BITS_PER_WORD (BITS_PER_WORD),
.PACKET_SIZE (PACKET_SIZE_TX),
.W_OUT (W_BUS_Y)
) UART_TX (
.clk (clk ),
.rstn (rstn ),
.s_ready (m_ready ),
.s_valid (m_valid ),
.s_data (o_flat ),
.tx (tx )
);
endmodule
18
FPGA Module (fpga_module.sv)
This initiates parameter of mvm_uart_system.v.
module fpga_module(
input logic clk, rstn, rx,
//input logic [NUM_WORDS-1:0][BITS_PER_WORD-1:0] s_data,
output logic tx//, s_ready
);
mvm_uart_system #(
.CLOCKS_PER_PULSE(1085), //200_000_000/9600
.BITS_PER_WORD(8),
.W_Y_OUT(32),
.R(8),.C(8),.W_X(8),.W_K(8)
) mvm_uart_system_0 (.*);
Endmodule
19
PYTHON CODE EXPLANATION
mvm_uart_system.py
This python code is like a test bench to test whether implemented RTL works
correctly in FPGA module.
Here a random 8 x 8 matrix and an 8 x 1 vector is generated, then multiplied
together, then final output is stored. For that, numpy library is imported.
Then the generated matrix and vector is input to FPGA via pyserial library
using UART protocol. Then after the calculation in FPGA, the output is taken,
then compare the output of FPGA and output of python script itself.
If both outputs are equal, the FPGA works properly. Otherwise, there is an
error in the process.
'''
Send k & x
'''
20
kx = np.concatenate([x, k.flatten()])
#concatenate x and k
kx_bytes = kx.tobytes()
#convert the concatenated x and k to bits and bytes
'''
Receive y
'''
#recieving outputs from fpga 'R' elements, each of size 4 bytes
y_bytes = ser.read(R*4)
y = np.frombuffer(y_bytes, dtype=np.int32)
#print(y_exp.tobytes())
21
PROCESS OF PROGRAMMING FPGA
Figure 13
Figure 14
22
3. Select source RTL files
Figure 15(a)
Figure 15(b)
23
4. Select the Constraint file (Zybo-Master.xdc)
Figure 16(a)
Figure 16(b)
24
##Clock signal
create_clock -add -name sys_clk_pin -period 8.00 -waveform {0 4} [get_ports { clk }];
##Switches
set_property -dict { PACKAGE_PIN G15 IOSTANDARD LVCMOS33 } [get_ports { rstn }];
#rstn to PACKAGE_PIN G15
##Pmod Header JC
set_property -dict { PACKAGE_PIN V15 IOSTANDARD LVCMOS33 } [get_ports { tx }];
#tx to PACKAGE_PIN V15
set_property -dict { PACKAGE_PIN T11 IOSTANDARD LVCMOS33 } [get_ports { rx }];
#rx to PACKAGE_PIN T11
This is the constraint file. This file is the file which connects the ports in FPGA
board with RTL design inputs and outputs
25
5. Select the board type
Figure 17
Figure 18
26
Now You are in the project window
Figure 19
7. Make sure whether you selected the FPGA module as the top module.
Figure 20
27
8. Generate Bit Stream (Synthesize and Implementation)
Figure 21
Figure 22
28
10. Then connect the ZYBO FPGA board to computer, turn it on, open
hardware manager, select open target and auto connect.
Figure 23
Figure 24
29
12. Finish programing the device
Figure 25
14. Then by turning off and turning on the sw[0], reset the FPGA
30
16. Observe the python shell
Figure 26
If result in python shell is same with the result gotten from FPGA, then the
implementation is successful. Otherwise there is some error.
31
UTILIZATION REPORT
After synthesis, before implementation, an estimated utilization report is
generated.
Figure 27(a)
Figure 27(b)
32
After implementation, the real utilization report is generated.
Figure 28(a)
Figure 28(b)
33
POWER REPORT
Figure 29
TIMING REPORT
Figure 30
34
AREA REPORT
Figure 31
35