0% found this document useful (0 votes)

48 views36 pages

Matrix Vector Multiplier Uart System

This document describes a matrix vector multiplier implemented using an adder tree. It multiplies an r x c matrix with a c x 1 vector to produce an r x 1 output vector. The adder tree adds pairs of partial products in each level to efficiently calculate the dot product of each row of the matrix with the vector. The document includes block diagrams of the adder tree structure, inputs and outputs of the matrix vector multiplier, and high-level block diagrams of the top-level design and RTL modules.

Uploaded by

Mahela Ekanayake

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views36 pages

Matrix Vector Multiplier Uart System

Uploaded by

Mahela Ekanayake

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

MATRIX VECTOR MULTIPLIER

UART SYSTEM

EKANAYAKE E.M.M.U.B.
981011031V
INTRODUCTION
Here an r x c matrix is multiplied with an c x 1 vector to obtain a r x 1 vector.

Figure 1

Let’s consider the dot product of rth row in matrix K with the vector X. The result
gives a the rth element in output vector Y.
In this situation, after multiplying each element in the row with the correspondent
element of the column, the products should be added together to get each element.
Here the addition of all the products cannot be done at once, therefore by adding 2
elements separately at once, we can reach the final outcome. For that we have to
implement an Adder Tree as shown in figure 2.

1
Figure 2

Here the DEPTH is how many levels the adder tree goes. In each level of that tree,
the additions are done simultaneously. The sums given by every adder in each level
are then transferred to the next level, then those sums are added again goes so on.
Finally, the output comes from the root adder is the final output.

The general form of an adder can be shown as in figure 3.

Figure 3

2
Here, a is the index of each adder in a single level of the adder tree.

Here is the inputs and outputs of matrix vector multiplier.

Figure 4

3
BLOCK DESIGN

Figure 5

First convert UART RX data comes from PC to AXI Stream data by a converter.
Then that data is input to matrix vector multiplier. After processing that data, the
output is sent to skid buffer if there is a delay in output. Then that output is finally
converted to UART TX data again and send it to

4
RTL EXPLANATION
RTL is comprised with 6 code files.

Figure 6

Matrix Vector Multiplier (matvec_mul.sv)

This is the main program which input an r x c matrix and a c x 1 vector and
output an r x 1 vector
In this matrix vector multiplier, there are no AXI signals

module matvec_mul #(
parameter R=8, // R - Number of Rows in matrix and size of input vector,
C=8, // C - Number of columns in matrix and size of output vector
W_X=8, // Width of an element in the input vector
W_K=8, // Width of an element in the matrix
localparam DEPTH = $clog2(C), // DEPTH of the addition tree goes
W_M = W_X + W_K,
W_Y = W_M + DEPTH // Width of an element in the product by
//multiplying element of input
// vector with multiplying element of matrix
)(
input logic clk, cen, // clock and clock enable
input logic signed [R-1:0][C-1:0][W_K-1:0] k, // input matrix R * C
input logic signed [C-1:0][W_X-1:0] x, // input vector R * 1
output logic signed [R-1:0][W_Y-1:0] y // output vector C * 1
);

5
// Padding
localparam C_PAD = 2**$clog2(C); // padding amount of matrix
logic signed [W_Y-1:0] tree [R][DEPTH+1][C_PAD]; // #R number of adder trees

wire signed [C_PAD-1:0][W_X-1:0] x_pad = {'0, x}; // padding the input vector by
//adding extra 0s
wire signed [R-1:0][C_PAD-1:0][W_K-1:0] k_pad; // padding the matrix

genvar r, c, d, a;
for (r=0; r<R; r=r+1) begin
assign k_pad[r] = {'0, k[r]}; // adding extra 0s to pad the matrix in each tree

for (c=0; c<C_PAD; c=c+1)

always_ff @(posedge clk) // register after multiplication
if (cen) tree[r][0][c] <= $signed(k_pad[r][c]) * $signed(x_pad[c]);
// multiplying elements in row of matrix with elements of vector and store
//them in tree

for (d=0; d<DEPTH; d=d+1)

for (a=0; a<C_PAD/2**(d+1); a=a+1)
always_ff @(posedge clk) // register after each add
if (cen) tree [r][d+1][a] <= tree [r][d][2*a] + tree [r][d][2*a+1];
// adding the products with each other

assign y[r] = tree [r][DEPTH][0]; // final output is at root

end
endmodule

Skid Buffer (skid_buffer.sv)

This is the skid buffer which stops the backpressure.

Figure 7

6
Here there are 2 states for the skid buffer. The Empty state and the Full state.
In Empty state, skid buffer allows to store data, while in Full state the AXI
stream should take the buffer data as input first.

Figure 8

module skid_buffer #( parameter WIDTH = 8)(

input logic clk, rstn, s_valid, m_ready,

input logic [WIDTH-1:0] s_data,
output logic [WIDTH-1:0] m_data,
output logic m_valid, s_ready
);
enum {FULL, EMPTY} state, state_next; // 2 states in the skid buffer state machine.
always_comb begin
state_next = state; // All the other cases, next state is the current state
case (state)
EMPTY : if(!m_ready && s_ready && s_valid) state_next = FULL;
// but if m_ready = 0 and s_ready and
//s_valid are 1,next state is FULL since data cannot be sent
// to the slave and skid buffer should be prepared to store input
FULL : if(m_ready) state_next = EMPTY;
// if m_ready = 1, next state is EMPTY since releasing unsent data
// and backpressure is released so no need of storing data in skid buffer
endcase
end
always_ff @(posedge clk)
if (!rstn) state <= EMPTY;

7
// if reset is done, state is EMPTY, open to store data
// in buffer
else if (m_ready || s_ready) state <= state_next;
// else if m_ready or s_ready is triggered, state equals next state.

logic b_valid; // valid signal from buffer

logic [WIDTH-1:0] b_data; // data output from buffer
wire [WIDTH-1:0] m_data_next = (state == FULL) ? b_data : s_data;
// multiplexer - if state is in FULL, next m_data should be the buffer output
//data,
// otherwise s_data should be the next m_data
wire m_valid_next = (state == FULL) ? b_valid : s_valid;
// multiplexer - if state is FULL, next m_valid should be buffer_valid,
// otherwise it should be s_valid
wire buffer_en = (state_next == FULL) && (state==EMPTY);
// buffer is enabled if next state is FULL and current state is EMPTY
wire m_en = m_valid_next & m_ready;
// master side is enabled if next m_valid and m_ready are true.

always_ff @(posedge clk)

if (!rstn) begin
s_ready <= 1;
// if reset is done, slave is ready to take data
{m_valid, b_valid} <= '0;
// first master valid and buffer valid short be turned off
end else begin
s_ready <= state_next == EMPTY;
// if next is EMPTY, then there will be no data in next state, so slave is ready
// to take data
if (buffer_en) b_valid <= s_valid;
// if buffer is enabled, buffer data validity is equal to slave data validity
if (m_ready ) m_valid <= m_valid_next;
// if master data is ready to output, validity of master data is equal to the
// next master data validity
end

always_ff @(posedge clk) begin

if (m_en) m_data <= m_data_next;
// if master is enabled, master data should be next master data
if (buffer_en && s_valid) b_data <= s_data;
// if buffer is enabled and slave data is valid, store buffer with slave data
end
endmodule

8
AXI Stream Matrix Vector Multiplier (axis_matvec_mul.v)
This is the wrapper which use to integrate matvec_mul.sv and skid_buffer.sv.
This is written in old Verilog. This is the top module over matrix vector
multiplier and skid buffer.
These top modules (or wrappers) are written in old Verilog in order to avoid
multi-dimensional tolks.

Figure 9

Figure 9 shows the block diagram of this module. Here, the clock enable is
same the slave ready signal from skid buffer. It means if skid buffer is not
ready to take data, matrix vector multiplier stops doing the job.
This integrated circuit has AXI signals.
Here LATENCY plays a significant role. LATENCY is the how many clock cycles it
takes to output data from Matvec Mul. (LATENCY = DEPTH + 1)
Simultaneously, s_valid also should be late by the same time. So it is shifted.

9
module axis_matvec_mul #(
parameter R=8, // R - Number of Rows in matrix and size of input vector,
C=8, // C - Number of columns in matrix and size of output vector
W_X=8, // Width of an element in the input vector
W_K=8, // Width of an element in the matrix
LATENCY = $clog2(C)+1, // How many clock cycles does it take to output
//after giving input
W_Y = W_X + W_K + $clog2(C) // Width of an element in the product by
// multiplying element of input
// vector with multiplying element of matrix
)(
input clk, rstn,
output s_axis_kx_tready, // slave ready
input s_axis_kx_tvalid, // slave valid
input [R*C*W_K + C*W_X -1:0] s_axis_kx_tdata, // whole input in 1 flat array
// This contains input vector and
// input matrix
input m_axis_y_tready, // master ready
output m_axis_y_tvalid, // master valid
output [R*W_Y -1:0] m_axis_y_tdata // output data
);
wire [R*C*W_K-1:0] k; // matrix
wire [C*W_X -1:0] x; // input vecor
assign {k, x} = s_axis_kx_tdata; // assigning input matrix and vector from input
// flat array.

wire [R*W_Y-1:0] i_data; // data bus to skid buffer from matrix vector multiplier
wire i_ready; // data ready bus from skid buffer to matrix vector multiplier

matvec_mul #( // assigning wires to skid buffer as in block diagram in figure 9

.R(R), .C(C), .W_X(W_X), .W_K(W_K)
) MATVEC (
.clk(clk ),
.cen(i_ready),
.k (k ),
.x (x ),
.y (i_data )
);

reg [LATENCY-2:0] shift; // sift up to maximum latency

reg i_valid;

always @(posedge clk or negedge rstn)

10
if (!rstn) {i_valid, shift} <= 0; // reset data valid bus to skid buffer
// and shift to 0
else if(i_ready) {i_valid, shift} <= {shift, s_axis_kx_tvalid}; // if skid buffer
// is ready to take data

skid_buffer #(.WIDTH(R*W_Y)) SKID ( // assigning wires to skid buffer as in block

// diagram in figure 9
.clk (clk ),
.rstn (rstn ),
.s_ready (i_ready),
.s_valid (i_valid),
.s_data (i_data ),
.m_ready (m_axis_y_tready),
.m_valid (m_axis_y_tvalid),
.m_data (m_axis_y_tdata )
);

assign s_axis_kx_tready = i_ready; // slave ready is the same slave ready of skid
// buffer as in figure 9

endmodule

UART RX to AXI Stream (uart_rx.sv)

Convert UART RX data come from PC to AXI Stream data.

Figure 10

11
Before starting the process, UART RX is 1.
To start the process, make RX to 0. Then after 1.5 times the clocks per pulse
(= Clock frequency of the FPGA board / baud rate a.k.a bits per second), send
the first bit.
Thereafter, send bit by bit after each clocks per pulse.
After sending all the 8 bits, send the parity bit if want as in the same way.
Then to stop the process, make RX = 1.
Figure 10 shows the state machine.

Figure 11

module uart_rx #(
parameter CLOCKS_PER_PULSE = 4, //200_000_000/9600 baud rate => 9600, clock
//frequency of fpga => 200_000_000
BITS_PER_WORD = 8, // bits in a word
W_OUT = 24 //R*C*W_K + C*W_X, // input to axis_matvec_mul.v
)(
input logic clk, rstn, rx,
output logic m_valid,
output logic [W_OUT-1:0] m_data
12
);
localparam NUM_WORDS = W_OUT/BITS_PER_WORD; // number of words

// Counters

logic [$clog2(CLOCKS_PER_PULSE) -1:0] c_clocks; // clock count between 2 bits

logic [$clog2(BITS_PER_WORD) -1:0] c_bits ; // count bits in a word
logic [$clog2(NUM_WORDS) -1:0] c_words ; // count number of words

// State Machine

enum {IDLE, START, DATA, END} state; // states in state machine

always_ff @(posedge clk or negedge rstn) begin

if (!rstn) begin
{c_words, c_bits, c_clocks, m_valid, m_data} <= '0; // reset clock count, bits
//count, word count and
// master valid and output
//data to 0s
state <= IDLE; // when reset, state is IDLE
end else begin
m_valid <= 0; // in other clock positive edges, first make master valid 0

case (state)

IDLE : if (rx == 0) // if Idle, check whether rx is 0

state <= START; // if so, state is START

START: if (c_clocks == CLOCKS_PER_PULSE/2-1) begin // after spending half of

//the clock per pulse
state <= DATA; // start data sending (state = DATA)
c_clocks <= 0; // make again clock count to 0
end else
c_clocks <= c_clocks + 1; // else increment

DATA : if (c_clocks == CLOCKS_PER_PULSE-1) begin // in DATA state, after

//spending clocks for 1 pulse
c_clocks <= 0; // set clock count to 0
m_data <= {rx, m_data[W_OUT-1:1]}; // shift the output data such
//that each rx values set
//into order

if (c_bits == BITS_PER_WORD-1) begin // after each clock per pulse,

//check whether if
// bit count is equals to bits
//per word (which is 8)
state <= END; // if so, state is END

13
c_bits <= 0; // bit count is set to 0

if (c_words == NUM_WORDS-1) begin // check whether word count

//reached the number of words,
m_valid <= 1; // if so, set the output data I
//is valid to output
c_words <= 0; // set word count to 0

end else c_words <= c_words + 1; // if word count didn't reach

/the number of words, increment
end else c_bits <= c_bits + 1; // if bits count didn't reach t
//the bits per word, increment
end else c_clocks <= c_clocks + 1; // if clock count didn't reach
//the clocks per pulse,
//increment

END : if (c_clocks == CLOCKS_PER_PULSE-1) begin // if clock count == clocks

//per pulse,
state <= IDLE; // set state to IDLE again
c_clocks <= 0; // set clock count to 0
end else
c_clocks <= c_clocks + 1; // increment the clock count
endcase
end
end
endmodule

UART TX to AXI Stream (uart_tx.sv)

This convert AXI Stream output of matrix vector multiplier to UART TX output
again.
Here, first the output AXI Stream data which are in word by word, a single
word is added further 3 extra bits for further usage, then encapsulate with
front 0 bit and ending 1 bit in order to convert the signal into UART form.
Then newly formed bit group is called a packet (‘0 + word + 3 extra bits (‘0s)+
‘1 => Packet)
Then those packets are rearranged into a whole flat vector in order, then the
TX output is connected to the newly formed flat vector’s 0 th index bit, then
that flat vector is shifted in order to send the bit by bit data to TX output.
14
The state machine is as follows.

Figure 12

module uart_tx #(

parameter CLOCKS_PER_PULSE = 4, //200_000_000/9600 same as in uart_rx.sv

BITS_PER_WORD = 8,
PACKET_SIZE = BITS_PER_WORD+5, //packet size after encapsulating
W_OUT = 24, //R*C*W_K + C*W_X Total word out after multiplication and
//addition in
//matrix multiplication process
localparam NUM_WORDS = W_OUT/BITS_PER_WORD // number of words
)(
input logic clk, rstn, s_valid,
input logic [NUM_WORDS-1:0][BITS_PER_WORD-1:0] s_data, // data output from matrix
//vector multiplier
output logic tx, s_ready
);
localparam END_BITS = PACKET_SIZE-BITS_PER_WORD-1; //extra 3 bits
logic [NUM_WORDS-1:0][PACKET_SIZE-1:0] s_packets; //for separated packets for each
//words in packed array
logic [NUM_WORDS*PACKET_SIZE -1:0] m_packets; //to rearrange those seprated
//words in to 1 single flat array
// in order to send

15
genvar n;
for (n=0; n<NUM_WORDS; n=n+1)
assign s_packets[n] = { ~(END_BITS'(0)), s_data[n], 1'b0}; //encapsulation

assign tx = m_packets[0]; //assign the first bit in flat array into tx, the output

// Counters
logic [$clog2(NUM_WORDS*PACKET_SIZE)-1:0] c_pulses; // number of pulses
logic [$clog2(CLOCKS_PER_PULSE) -1:0] c_clocks; // number of clocks

// State Machine

enum {IDLE, SEND} state;

always_ff @(posedge clk or negedge rstn) begin

if (!rstn) begin
state <= IDLE; // when reset, state is IDLE
m_packets <= '1; // initiate all the data to be sent to 1
{c_pulses, c_clocks} <= 0; // set number of clocks and pulses to 0
end else
case (state)
IDLE : if (s_valid) begin // if data output from matrix vector multiplier is
//valid
state <= SEND; // switch to SEND state
m_packets <= s_packets; // rearrange the separated words into 1
//single flat array
end

SEND : if (c_clocks == CLOCKS_PER_PULSE-1) begin //if number of clocks reached

//clocks per pulse
c_clocks <= 0; //set number of clocks to 0

if (c_pulses == NUM_WORDS*PACKET_SIZE-1) begin // if number of

//pulses reached total number of
// bits to be sent
c_pulses <= 0; // number of pulses to 0
m_packets <= '1; // reset the whole flat array
state <= IDLE; // switch IDLE state

end else begin

c_pulses <= c_pulses + 1; // number of pulses increment
m_packets <= (m_packets >> 1); //shifting the whole flat array in
//order to send data via tx output
end

end else c_clocks <= c_clocks + 1; // number of clocks increment

endcase

16
end

Final MVM Wrapper (mvm_uart_system.v)

This connects UART RX, TX and MVM AXI Stream system together. This is also
a old verilog file due to the same reason in axis_matvec_mul.v.

module mvm_uart_system #(
parameter CLOCKS_PER_PULSE = 200_000_000/9600, //200_000_000/9600
BITS_PER_WORD = 8,
PACKET_SIZE_TX = BITS_PER_WORD + 5,
W_Y_OUT = 32,
R=8, C=8, W_X=8, W_K=8
)(
input clk, rstn, rx,
output tx
);

localparam W_BUS_KX = RCW_K + C*W_X,

W_BUS_Y = R*W_Y_OUT,
W_Y = W_X + W_K + $clog2(C);

wire s_valid, m_valid, s_ready, m_ready;

wire [W_BUS_KX-1:0] s_data_kx;

uart_rx #(
.CLOCKS_PER_PULSE (CLOCKS_PER_PULSE),
.BITS_PER_WORD (BITS_PER_WORD),
.W_OUT (W_BUS_KX)
) UART_RX (
.clk (clk),
.rstn (rstn),
.rx (rx),
.m_valid(s_valid),
.m_data (s_data_kx)
);

wire [R*W_Y -1:0] m_data_y;

axis_matvec_mul #(
.R(R), .C(C), .W_X(W_X), .W_K(W_K)
) AXIS_MVM (
.clk (clk ),
.rstn (rstn ),
.s_axis_kx_tready(s_ready ), // assume always valid
17
.s_axis_kx_tvalid(s_valid ),
.s_axis_kx_tdata (s_data_kx),
.m_axis_y_tready (m_ready ),
.m_axis_y_tvalid (m_valid ),
.m_axis_y_tdata (m_data_y )
);

// Padding to 32 bits to be read in computer

wire [W_Y -1:0] y_up [R-1:0];
wire [W_Y_OUT -1:0] o_up [R-1:0];
wire [R*W_Y_OUT-1:0] o_flat;

genvar r;
for (r=0; r<R; r=r+1) begin
assign y_up [r] = m_data_y[W_Y*(r+1)-1 : W_Y*r];
assign o_up [r] = $signed(y_up[r]); // sign extend to 32b
assign o_flat[W_Y_OUT*(r+1)-1 : W_Y_OUT*r] = o_up[r];
// assign o_flat[W_Y_OUT*(r+1)-1 : W_Y_OUT*r] = $signed(m_data_y[W_Y*(r+1)-1 :
W_Y*r]);
end

uart_tx #(
.CLOCKS_PER_PULSE (CLOCKS_PER_PULSE),
.BITS_PER_WORD (BITS_PER_WORD),
.PACKET_SIZE (PACKET_SIZE_TX),
.W_OUT (W_BUS_Y)
) UART_TX (
.clk (clk ),
.rstn (rstn ),
.s_ready (m_ready ),
.s_valid (m_valid ),
.s_data (o_flat ),
.tx (tx )
);

endmodule

18
FPGA Module (fpga_module.sv)
This initiates parameter of mvm_uart_system.v.

`timescale 1ns / 1ps

module fpga_module(
input logic clk, rstn, rx,
//input logic [NUM_WORDS-1:0][BITS_PER_WORD-1:0] s_data,
output logic tx//, s_ready
);

mvm_uart_system #(
.CLOCKS_PER_PULSE(1085), //200_000_000/9600
.BITS_PER_WORD(8),
.W_Y_OUT(32),
.R(8),.C(8),.W_X(8),.W_K(8)
) mvm_uart_system_0 (.*);

Endmodule

19
PYTHON CODE EXPLANATION
mvm_uart_system.py
This python code is like a test bench to test whether implemented RTL works
correctly in FPGA module.
Here a random 8 x 8 matrix and an 8 x 1 vector is generated, then multiplied
together, then final output is stored. For that, numpy library is imported.
Then the generated matrix and vector is input to FPGA via pyserial library
using UART protocol. Then after the calculation in FPGA, the output is taken,
then compare the output of FPGA and output of python script itself.
If both outputs are equal, the FPGA works properly. Otherwise, there is an
error in the process.

import numpy as np #import numpy

import serial #import pyserial

R=8 #number of rows

C=8 #number of columns

#serial.Serial(NAME_OF_UART_PORT, BAUD_RATE, READ_TIME_OUT)

ser = serial.Serial('/dev/ttyUSB0', 115200, timeout=0.050)
for i in range(20):
print(f"****************** TEST {i+1} ************************\n\n")

k = np.random.randint(low=-27, high=27, size=(R,C), dtype=np.int8)

#random 8 x 8 matrix
x = np.random.randint(low=-2**7, high=2**7, size=(C), dtype=np.int8)
#random 8 x 1 vector
y_exp = k.astype(np.int32) @ x.astype(np.int32)
#calculating the multiplication of matrix and vector

'''
Send k & x
'''

20
kx = np.concatenate([x, k.flatten()])
#concatenate x and k
kx_bytes = kx.tobytes()
#convert the concatenated x and k to bits and bytes

print(f"\n\n {k=} \n\n{x=} \n\n{kx=} \n\nSent: {kx_bytes= } \n\n")

#Sending inputs to fpga
no_of_bytes_sent = ser.write(kx_bytes)

'''
Receive y
'''
#recieving outputs from fpga 'R' elements, each of size 4 bytes
y_bytes = ser.read(R*4)
y = np.frombuffer(y_bytes, dtype=np.int32)
#print(y_exp.tobytes())

print(f"Received: \n\n{y=} \n\n {y_exp=} \n\n\n")

assert (y == y_exp).all(), f"Output doesn't match: y:{y} != y_exp:{y_exp}"

# check whether output from FPGA and python script are same

21
PROCESS OF PROGRAMMING FPGA

1. Start a new project

Figure 13

2. Give a project name

Figure 14
22
3. Select source RTL files

Figure 15(a)

Figure 15(b)

23
4. Select the Constraint file (Zybo-Master.xdc)

Figure 16(a)

Figure 16(b)

24
##Clock signal

set_property -dict { PACKAGE_PIN L16 IOSTANDARD LVCMOS33 } [get_ports { clk }];

#IO_L11P_T1_SRCC_35 Sch=sysclk

create_clock -add -name sys_clk_pin -period 8.00 -waveform {0 4} [get_ports { clk }];

##Switches
set_property -dict { PACKAGE_PIN G15 IOSTANDARD LVCMOS33 } [get_ports { rstn }];
#rstn to PACKAGE_PIN G15

##Pmod Header JC
set_property -dict { PACKAGE_PIN V15 IOSTANDARD LVCMOS33 } [get_ports { tx }];
#tx to PACKAGE_PIN V15
set_property -dict { PACKAGE_PIN T11 IOSTANDARD LVCMOS33 } [get_ports { rx }];
#rx to PACKAGE_PIN T11

This is the constraint file. This file is the file which connects the ports in FPGA
board with RTL design inputs and outputs

25
5. Select the board type

Figure 17

6. Finish creating the project

Figure 18
26
Now You are in the project window

Figure 19

7. Make sure whether you selected the FPGA module as the top module.

Figure 20

27
8. Generate Bit Stream (Synthesize and Implementation)

Figure 21

9. Then wait till the synthetization an implementation completes

Figure 22

28
10. Then connect the ZYBO FPGA board to computer, turn it on, open
hardware manager, select open target and auto connect.

Figure 23

11. Select program device

Figure 24

29
12. Finish programing the device

Figure 25

13. Connect USB-tty to both FPGA and laptop. Check whether it is

connected in device manager.

14. Then by turning off and turning on the sw[0], reset the FPGA

15. Then run the python script

30
16. Observe the python shell

Figure 26

If result in python shell is same with the result gotten from FPGA, then the
implementation is successful. Otherwise there is some error.

31
UTILIZATION REPORT
After synthesis, before implementation, an estimated utilization report is
generated.

Figure 27(a)

Figure 27(b)

32
After implementation, the real utilization report is generated.

Figure 28(a)

Figure 28(b)

33
POWER REPORT

Figure 29

TIMING REPORT

Figure 30

34
AREA REPORT

Figure 31

Ch-8 CPU-MM
100% (2)
Ch-8 CPU-MM
46 pages
Fifo Uvm
No ratings yet
Fifo Uvm
15 pages
M4 1.RISCV Datapath
No ratings yet
M4 1.RISCV Datapath
93 pages
8-Bit Microprocessor: VLSI Architecture Project Report On
No ratings yet
8-Bit Microprocessor: VLSI Architecture Project Report On
35 pages
VHDL Codes
100% (4)
VHDL Codes
6 pages
03 Microprocessors
No ratings yet
03 Microprocessors
129 pages
Unit 2
No ratings yet
Unit 2
69 pages
EE 533 Verilog Design: Siddharth Bhargav
No ratings yet
EE 533 Verilog Design: Siddharth Bhargav
25 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
108 pages
FALLSEM2021-22 CSE2006 ETH VL2021220104026 Reference Material I 16-11-2021 23-A-8087-Coprocessor Instructions-Programming
No ratings yet
FALLSEM2021-22 CSE2006 ETH VL2021220104026 Reference Material I 16-11-2021 23-A-8087-Coprocessor Instructions-Programming
51 pages
06 Verilog Synth
No ratings yet
06 Verilog Synth
41 pages
Chapter2 Part 2 Machine Instructions and Programs
No ratings yet
Chapter2 Part 2 Machine Instructions and Programs
38 pages
Group 3 - Final Project Report
No ratings yet
Group 3 - Final Project Report
79 pages
Basic Computer Organisation
No ratings yet
Basic Computer Organisation
83 pages
Unit2 COD
No ratings yet
Unit2 COD
70 pages
Control Unit:: Hardwired vs. Microprogrammed Approach
No ratings yet
Control Unit:: Hardwired vs. Microprogrammed Approach
49 pages
Introduction To RISC and CISC:: RISC (Reduced Instruction Set Computer)
No ratings yet
Introduction To RISC and CISC:: RISC (Reduced Instruction Set Computer)
62 pages
EEE 270 Advanced Topics in Logic Design Control and Sequencing: Hardwired and Microprogrammed Control
No ratings yet
EEE 270 Advanced Topics in Logic Design Control and Sequencing: Hardwired and Microprogrammed Control
56 pages
The Final Datapath: Add M U X
No ratings yet
The Final Datapath: Add M U X
32 pages
Digital System Design
No ratings yet
Digital System Design
41 pages
Verilog Modules For Common Digital Functions
No ratings yet
Verilog Modules For Common Digital Functions
30 pages
JCOMP
No ratings yet
JCOMP
22 pages
IL Instruction Overview E
No ratings yet
IL Instruction Overview E
10 pages
CS6710 Mipsx2
No ratings yet
CS6710 Mipsx2
27 pages
Computer Architecture: CSCE 350
No ratings yet
Computer Architecture: CSCE 350
41 pages
SystemC-n-BehaviorCoding Fall2021 Section5 HLS
No ratings yet
SystemC-n-BehaviorCoding Fall2021 Section5 HLS
63 pages
220 PracticeProblems 8 MultiCycleDP Sol
No ratings yet
220 PracticeProblems 8 MultiCycleDP Sol
34 pages
Unit 4
No ratings yet
Unit 4
53 pages
Assignment 1 Key
No ratings yet
Assignment 1 Key
16 pages
SV - Datapath and Control
No ratings yet
SV - Datapath and Control
106 pages
KAIST cs311 05 Proc I
No ratings yet
KAIST cs311 05 Proc I
28 pages
Week10 Sequential Circuits
No ratings yet
Week10 Sequential Circuits
27 pages
Chapter 5 My Note
No ratings yet
Chapter 5 My Note
48 pages
7.7 Sectioion 7 Architecture, Data Communication & Networking
No ratings yet
7.7 Sectioion 7 Architecture, Data Communication & Networking
21 pages
MIPS Mul Div, and MIPS Floating Point Instructions
No ratings yet
MIPS Mul Div, and MIPS Floating Point Instructions
13 pages
COMPUTER ORGANIZATION (UNIT - 2) - Note
No ratings yet
COMPUTER ORGANIZATION (UNIT - 2) - Note
33 pages
8085 Instruction Set
No ratings yet
8085 Instruction Set
17 pages
SPI Module
No ratings yet
SPI Module
5 pages
New Microprocessor 1-Rec
No ratings yet
New Microprocessor 1-Rec
14 pages
Wa0031.
No ratings yet
Wa0031.
10 pages
CO - 7th UNIT
No ratings yet
CO - 7th UNIT
17 pages
Lecture 4 - Datapath02
No ratings yet
Lecture 4 - Datapath02
75 pages
Lecture 7
No ratings yet
Lecture 7
25 pages
DucHuy CA Lab2 2021
No ratings yet
DucHuy CA Lab2 2021
25 pages
Question 5
No ratings yet
Question 5
31 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
15 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
15 pages
Inp. Opt. of BC, Interrupt Initiated IO Design of Basic Computer&AC
No ratings yet
Inp. Opt. of BC, Interrupt Initiated IO Design of Basic Computer&AC
13 pages
Unit 4
No ratings yet
Unit 4
9 pages
COL216 Assignment 4: 1 Problem Statement
No ratings yet
COL216 Assignment 4: 1 Problem Statement
4 pages
Mejores Practicas
No ratings yet
Mejores Practicas
9 pages
Stores A Program in Memory As Instructions and Executes Them Sequentially Using The ALU, Control Unit and Registers
No ratings yet
Stores A Program in Memory As Instructions and Executes Them Sequentially Using The ALU, Control Unit and Registers
22 pages
FDP Assignment Solutions
No ratings yet
FDP Assignment Solutions
7 pages
MOD 4-I Simple Computer - Bottom Up Implementation
No ratings yet
MOD 4-I Simple Computer - Bottom Up Implementation
11 pages
Leglite
No ratings yet
Leglite
5 pages
22BHI10090 Embedded Practical File
No ratings yet
22BHI10090 Embedded Practical File
7 pages
Proj Part2
No ratings yet
Proj Part2
6 pages
EE7
No ratings yet
EE7
4 pages
Arithmetic Instructions Mnemonics Description Bytes Instruction Cycles
No ratings yet
Arithmetic Instructions Mnemonics Description Bytes Instruction Cycles
3 pages

Matrix Vector Multiplier Uart System

Uploaded by

Matrix Vector Multiplier Uart System

Uploaded by

MATRIX VECTOR MULTIPLIER

The general form of an adder can be shown as in figure 3.

Here is the inputs and outputs of matrix vector multiplier.

Matrix Vector Multiplier (matvec_mul.sv)

for (c=0; c<C_PAD; c=c+1)

for (d=0; d<DEPTH; d=d+1)

assign y[r] = tree [r][DEPTH][0]; // final output is at root

Skid Buffer (skid_buffer.sv)

module skid_buffer #( parameter WIDTH = 8)(

input logic clk, rstn, s_valid, m_ready,

logic b_valid; // valid signal from buffer

always_ff @(posedge clk)

always_ff @(posedge clk) begin

matvec_mul #( // assigning wires to skid buffer as in block diagram in figure 9

reg [LATENCY-2:0] shift; // sift up to maximum latency

always @(posedge clk or negedge rstn)

skid_buffer #(.WIDTH(R*W_Y)) SKID ( // assigning wires to skid buffer as in block

UART RX to AXI Stream (uart_rx.sv)

logic [$clog2(CLOCKS_PER_PULSE) -1:0] c_clocks; // clock count between 2 bits

enum {IDLE, START, DATA, END} state; // states in state machine

always_ff @(posedge clk or negedge rstn) begin

IDLE : if (rx == 0) // if Idle, check whether rx is 0

START: if (c_clocks == CLOCKS_PER_PULSE/2-1) begin // after spending half of

DATA : if (c_clocks == CLOCKS_PER_PULSE-1) begin // in DATA state, after

if (c_bits == BITS_PER_WORD-1) begin // after each clock per pulse,

if (c_words == NUM_WORDS-1) begin // check whether word count

end else c_words <= c_words + 1; // if word count didn't reach

END : if (c_clocks == CLOCKS_PER_PULSE-1) begin // if clock count == clocks

UART TX to AXI Stream (uart_tx.sv)

parameter CLOCKS_PER_PULSE = 4, //200_000_000/9600 same as in uart_rx.sv

enum {IDLE, SEND} state;

always_ff @(posedge clk or negedge rstn) begin

SEND : if (c_clocks == CLOCKS_PER_PULSE-1) begin //if number of clocks reached

if (c_pulses == NUM_WORDS*PACKET_SIZE-1) begin // if number of

end else begin

end else c_clocks <= c_clocks + 1; // number of clocks increment

Final MVM Wrapper (mvm_uart_system.v)

localparam W_BUS_KX = R*C*W_K + C*W_X,

wire s_valid, m_valid, s_ready, m_ready;

wire [R*W_Y -1:0] m_data_y;

// Padding to 32 bits to be read in computer

`timescale 1ns / 1ps

import numpy as np #import numpy

R=8 #number of rows

#serial.Serial(NAME_OF_UART_PORT, BAUD_RATE, READ_TIME_OUT)

k = np.random.randint(low=-2**7, high=2**7, size=(R,C), dtype=np.int8)

print(f"\n\n {k=} \n\n{x=} \n\n{kx=} \n\nSent: {kx_bytes= } \n\n")

print(f"Received: \n\n{y=} \n\n {y_exp=} \n\n\n")

assert (y == y_exp).all(), f"Output doesn't match: y:{y} != y_exp:{y_exp}"

1. Start a new project

2. Give a project name

set_property -dict { PACKAGE_PIN L16 IOSTANDARD LVCMOS33 } [get_ports { clk }];

6. Finish creating the project

9. Then wait till the synthetization an implementation completes

11. Select program device

13. Connect USB-tty to both FPGA and laptop. Check whether it is

15. Then run the python script

You might also like

localparam W_BUS_KX = RCW_K + C*W_X,

k = np.random.randint(low=-27, high=27, size=(R,C), dtype=np.int8)