0% found this document useful (0 votes)
22 views18 pages

Most Resource Efficient Matrix Vector Multiplication On FPGAs

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views18 pages

Most Resource Efficient Matrix Vector Multiplication On FPGAs

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Received 7 December 2022, accepted 26 December 2022, date of publication 5 January 2023, date of current version 12 January 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3234622

Most Resource Efficient Matrix Vector


Multiplication on FPGAs
ALEXANDER LEHNERT 1 , PHILIPP HOLZINGER 2 , SIMON PFENNING 2 ,
RALF MÜLLER 3 , (Fellow, IEEE), AND MARC REICHENBACH 1 , (Member, IEEE)
1 Chair of Computer Engineering, Brandenburg University of Technology Cottbus-Senftenberg, 03046 Cottbus, Germany
2 Chair of Computer Architecture, Friedrich-Alexander University Erlangen-Nürnberg, 91058 Erlangen, Germany
3 Institute for Digital Communications, Friedrich-Alexander University Erlangen-Nürnberg, 91058 Erlangen, Germany

Corresponding author: Alexander Lehnert ([email protected])


This work was supported in part by the German Research Foundation Deutsche Forschungsgesellschaft (DFG) through the Project
Berechnungscodierung under Grant RE 4182/4-1 and Grant MU 3735/8-1.

ABSTRACT Fast and resource-efficient inference in artificial neural networks (ANNs) is of utmost
importance and drives many new developments in the area of new hardware architectures, e.g., by means
of systolic arrays or algorithmic optimization such as pruning. In this paper, we present a novel method for
lowering the computation effort for ANN inference utilizing ideas from information theory. Weight matrices
are sliced into submatrices of logarithmic aspect ratios. These slices are then factorized. This reduces the
number of required computations without compromising on fully parallel processing. We create a new
hardware architecture for this dedicated purpose. We also provide a tool to map these sliced and factorized
matrices efficiently to reconfigurable hardware. By comparing to the state of the art FPGA implementations,
we can prove our claim by lowering hardware resources measured in look-up-tables (LUTs) by a factor of
three to six. Our method does not rely on any particular property of the weight matrices of the ANN. It works
for the general task of multiplying an input vector with a constant matrix and is also suitable for digital signal
processing beyond ANNs.

INDEX TERMS Constant matrix multiplication, neural networks, computer architecture, reconfigurable
architectures, computational efficiency.

I. INTRODUCTION In the past, there were efforts to improve the computational


Artificial Neural Networks (ANNs) are widely used today complexity of these operations [12], and also approximate
in different application fields such as image processing [1], methods were explored [13].
[2], [3], [4], speech recognition [5], [6], [7] or predictive With the goal of area, as well as power efficient archi-
maintenance [8], [9]. Compared to classical signal process- tectures implementing CMMs of ANNs or DSP algorithms,
ing algorithms, they can achieve a very high classification several approaches in the past were researched. Then can
quality without manual design of handcrafted algorithms. roughly be divided into the three following domains.
While these outstanding features will enable to solve even 1) Algorithm optimization such as quantization (use num-
more and more complex problems, computational effort of bers with a limited bit width) and pruning (e.g. the
such ANNs could become very large and energy intensive. complete removal of neurons)
This is especially true for inference in ANNs which mainly 2) Specialized dataflow architectures such as systolic
relies on constant matrix vector multiplications (CMVMs) arrays or coarse-grained reconfigurable arrays
and is the focus of this paper. Throughout the paper the terms 3) Advances in technology such as crossbar arrays or
CMVM and constant matrix multiplication (CMM) are used memristive memory cells
interchangeably. Next to ANNs also many digital signal pro- These approaches can be combined in a smart way, e.g.
cessing (DSP) algorithms rely heavily on CMMs [10], [11]. with 1) the ANN is modified in a pre-defined way, which
The associate editor coordinating the review of this manuscript and an architecture 2) can utilize as a priori knowledge to build
approving it for publication was Alireza Sadeghian. very fast accelerator architectures. This is also true for DSP

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 11, 2023 3881
A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

algorithms which are mainly using fixed matrices and thus per row. Furthermore, all those additions can be executed,
provide even more application-independent a priori knowl- in parallel. From now on we will refer to such well-behaved
edge. We present 1) a decomposition algorithm to restructure matrices as CC-matrices, referring to the computation coding
matrices based on a priori knowledge to then 2) provide an (CC) decomposition algorithm which they originate from.
architecture that makes use of the restructured information to As firstly described in [16] and [15], we can transform
lower hardware cost while offering a high throughput. With a matrix into a set of CC-matrices. All these new matri-
the goal of high throughput in mind, we refer to an efficient ces are well-behaved meaning each row of them features a
design as one that is 1) resource-aware, i.e. minimizes hard- fixed number of non-zero entries per row which are signed
ware cost, while maximizing 2) throughput and 3) energy- powers of two. When implementing a matrix-vector product
awareness. Architectures are designed fully rolled-out to architecture the well-behaved property of the underlying CC
preserve high throughput and are compared as such. Compet- matrices leads to a lower computational effort because no
ing designs that exist for ANNs are, e.g., FINN [14], a design multiplications are needed anymore as they can be replaced
framework for quantized neural networks, and for general by shifts. Moreover, the prior knowledge of the structure of
CMVM, e.g., approaches based on Canonically Signed Digit F1,1 to FS,P will enable the creation of dedicated hardware
(CSD) representation [13]. circuits, which perfectly utilize this approach. Nevertheless,
To optimize computation effort in ANNs, a close look to as shown in (3), this transformation will introduce a small
their internal structure is necessary: The architecture of an error. Fortunately, similar to any fixed point arithmetic the
ANN consists of several layers. For the inference of an ANN, error can be determined and lowered arbitrarily well which is
the equation important for the overall accuracy of the ANN inference.
While in [16] and [15], the basic idea of the matrix factor-
a = φ(Wv + b) (1) ization approach was already described, a hardware realiza-
has to be solved for each layer. Here and in the following, tion to prove the idea with real numbers was not given yet.
W denotes the weight matrix, v the input vector, a the output Furthermore, [16] and [15] suggested a horizontal decompo-
vector, b the bias vector, and φ the so-called activation func- sition of the matrix W. With respect to hardware realization,
tion. While in current ANNs, the scalar functions φ involve the vertical decomposition proposed in (2), is much better
low computation effort (e.g. rectified linear unit (ReLU)), suited as we will show later on. Therefore, with this paper
as they operate element-wise, the matrix-vector multipli- we first introduce a hardware realization, based on reconfig-
cation Wv remains computationally intensive. Therefore, urable logic (FPGAs), which was dedicatedly designed for
we will present in this paper a novel approach to optimize this approach. We will show, that using this new approach and
exactly this calculation which also can be directly applied to our hardware architecture, we can save up to 80% hardware
DSP algorithms based on CMMs. For this purpose, we solve resources compared to a standard design flow on FPGAs.
this problem also on the above-mentioned two levels, i.e. 1) The underlying matrices of many DSP algorithms, e.g.
the algorithmic level and 2) provide a dedicated hardware Fourier transforms, are fixed and application-independent.
architecture. Thus, it is simple to design an architecture implementing
The basic idea we propose is to vertically slice the unre- them. For the case of ANNs, these matrices differ from appli-
stricted matrix into S submatrices cation to application and layer to layer. But due to the fact,
that weight matrices are created only once for an application,
W = [W1 |W2 | . . . |WS ] (2) but are reused for every inference, we can utilize the reconfig-
uration ability of FPGAs to address any ANN. Moreover, the
which are subsequently factorized into P matrix factors as
internal structure of the CC matrices can be perfectly utilized
proposed in [15]
by FPGAs, since shift-operations are just wiring on an FPGA,
Ws ≈ Fs,P · · · Fs,1 Fs,0 . (3) which will cost neither additional hardware resources nor
energy. Implementations on application-specific integrated
In the application of ANNs, the matrices that are decomposed circuits (ASICs) also benefit from the latter point but lack the
are the weight matrices. As we will discuss in Section III aspect of reconfigurability. This means, FPGAs will be the
in full details, this decomposition (so-called computation perfect candidate for this kind of algorithm. In this paper we
coding) will bring the following advantages: show the combination of matrix decomposition and recon-
• The matrices F1,1 to FS,1 do not require computations at figurable logic for the first time. In Figure 1, this concept is
all. shown graphically and compared to the state-of-the-art (SoA)
• The matrices F1,1 to FS,P will be sparse with a well solution using an ANN as example: At the left, an example
defined structure. of a weight matrix is shown. For general CMMs, as they are
• The matrices F1,1 to FS,P will only contain numbers used in, e.g., DSP algorithms, the approach stays the same,
related to a power of two. only the matrix differs. The traditional approach to a matrix-
The matrices F1,1 to FS,P are sparse and contain only vector-product architecture requires many multipliers and
signed powers of two such that the multiplication of any of adders. In contrast, our approach presented later in Section IV
them with a vector requires only a fixed number of additions benefits from the well-behaved structure of the CC-matrices

3882 VOLUME 11, 2023


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

non-bold letters denote indices running from 0 or 1 to the


respective upper case letter.

II. RELATED WORK


A. HARDWARE ARCHITECTURES
One of the main drivers of deep learning was the vast amount
of computational resources graphic processing units (GPUs)
could provide to train and evaluate sufficiently powerful neu-
ral networks. However, with the widespread usage of deep
learning and the expansion to further domains like automo-
tive, mobile, and edge devices, additional factors like energy
efficiency, latency, and runtime predictability became more
urgent. For this reason, a substantial amount of research has
focused on the acceleration of neural networks with spe-
cialized hardware in the last years [17]. Hereby, three main
directions of optimization can be found in literature, which
are not mutually exclusive, but are often combined for even
greater benefits.
The first category is the design of data-driven digital
circuits and its automation. While GPUs with their single-
instruction multiple-threads-(SIMT)-style architecture offer
many computational units with less control logic than central
processing units (CPUs), they are still fully programmable.
Hence, they inherently have a considerable amount of over-
FIGURE 1. Comparison between state-of-the art mapping of ANN (at the
top) and linear computation coding (at the bottom) onto reconfigurable head, which is not needed for the smaller subset of operations
hardware. in deep learning. Therefore, specialized dataflow architec-
tures came in the focus of interest. One of the first candidates
for this purpose were systolic arrays, which were already
and does only require shifters and a fixed small amount of concisely described in 1978 [18]. Their locally connected
adders. Additionally, the linear computation coding approach structure of processing elements not only reduces the con-
decomposes the original matrix into multiple CC-matrices. trol hardware, but also increases the amount of local data
Due to their unique structure, a resource-aware hardware movement. As a consequence of the fewer slow external
mapping is possible, which results in limited usage of adders memory accesses, this approach also mitigates the widening
and a short critical path. processor-memory gap, which has the potential to consider-
This paper is structured as follows. In the introduction, ably improve performance and energy consumption. Due to
we show the importance of this topic and explain the basic these benefits, the concept has been used in many current
idea. Section II discusses previous related work in two designs and most prominently in Google’s Tensor Processing
domains, first hardware architecture approaches and second Unit (TPU) [19], [20], [21]. For the same reasons, dataflow
developments from an algorithmic point of view. Further, processing schemes have been similarly applied in varying
in Section III we present our computation coding approach scales to other architectures [22]. On a small scale, GPUs
of decomposition of matrices in a detailed way. Afterwards, nowadays also incorporate specialized cores that efficiently
Section IV explains the architecture and hardware realization process 4 × 4 matrix-matrix multiplications [23]. Further-
of our approach on reconfigurable hardware, first for general more, coarse-grained reconfigurable arrays (CGRAs) have
CMMs and later for ANNs. In Section V, we prove the been employed as a trade-off between programmability and
working principle of our architecture by explaining our exper- efficiency [24], [25]. Hereby, the programmable processing
iments and evaluating their results. Additionally, Section VI cores directly source data from and provide data to other
provides a further implementation of a multi layer percep- nearby cores via a routing fabric to keep data as local as
tron (MLP) with further efficiency comparisons to other possible. In the other extreme, several approaches propose to
implementation methods. Finally, Section VII concludes the entirely forgo control flow and generate dedicated accelera-
paper. tors for specific networks [14], [26]. These architectures usu-
Throughout the paper, matrices and vectors are denoted by ally map layers or the complete model to own hardware for
boldface upper case and boldface lower case letters, respec- the highest efficiency at the cost of flexibility. While automa-
tively. Non-bold indexed letters denote the entries of the tion frameworks for all kinds of deep learning accelerators
respective matrices and vectors in boldface. Design vari- are nowadays indispensable, in particular these latter types
ables are denoted by non-bold upper case letters. Lower case make heavy use of network metadata like the number ranges

VOLUME 11, 2023 3883


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

of input, intermediate, and output values or the composition used in traditional circuits and are therefore more reliable.
of the weights matrices [27], [28]. Regarding novel components, memristive memory cells have
Due to the direct influence of the network structure on the become a field of active research for deep learning [34], [35],
efficiency of the accelerator circuits, optimizations usually [36], [39]. As non-volatile, electrically alterable resistances,
already begin at the network itself. In this second direc- they enable storage and in-memory computing in the same
tion of optimization two main approaches have emerged in device. Furthermore, they promise a high cell density and
literature. First, the quantization of weights and data from simpler fabrication in conjunction with digital logic cells
32 bit floating point to a fixed point representation with a due to the full complementary metal-oxide-semiconductor
smaller bit width [29], [30]. This method has two benefits. (CMOS) compatibility [40]. Aside from the classical data
It reduces the complexity of arithmetic operations while at processing with electric circuits, silicon photonics also has
the same time decreasing the amount of memory needed been presented as an approach for deep learning [41], [42].
for weights. Therefore, a single operation is not only more Due to its unprecedented possible bandwidth, photonic com-
memory efficient, but more can be calculated at once with the puting systems promise high performance and energy effi-
same memory bandwidth. As smaller bitwidths can also be ciency. However, there is still a long way until these systems
found in other application domains, traditional architectures are industrially viable outside of the network communication
of CPUs and GPUs already incorporate vector processing sector [43]. Although, our approach presented in this paper is
capabilities. However, these are usually limited to fixed sizes based on classical electrical circuits, it can be combined with
of 8 bit, 16 bit, 32 bit and 64 bit. Despite the recent support of these technology-driven optimizations in the future.
further operand types like int4 and bfloat16, the optimal
values can heavily vary between neural networks and do B. ALGORITHMIC FUNDAMENTALS
often not coincide with these fixed widths. Therefore, several From the pioneering work of Strassen [44] and improvements
approaches use hardware that is specifically adapted for the of the same [45], we know that matrix multiplication can
applications by quantizing the network as far as ternary or be performed more efficiently than by the standard method
binary weights [14], [26], [28]. Adjacent to the quantization, of calculating inner products of rows and columns. How-
pruning has been established as the second way to prepare ever, the Strassen algorithm brings only benefits for matrix
a network for optimized hardware [31]. Here, weights are ranks in the thousands and beyond. Furthermore, applying
successively set to zero and then stored in compressed for- Strassen’s ideas to ANNs requires buffering the input vectors
mats. Although this method makes the control flow logic until an input matrix with sufficiently large rank has been
more complex to parse the weight storage, the overall amount accumulated. Thus, the Strassen algorithm and its further
of arithmetic operations is drastically reduced as multiplica- improvements have remained a well-studied subject in the-
tions and additions with 0 can be completely stripped away. oretical computer science, but not entered algorithm design
This leads to a sparse matrix multiplication, which can be for matrix-vector multiplication in ANNs. In this work, we
calculated faster and with less energy than the original [32], follow a very different line of ideas, instead.
[33]. Further research has explored the optimization of con- Higher accuracy of computation, in general, results in
stant matrix multiplication by converting entries to the CSD higher computational load. Any improvement in the former
representation and then optimizing the resulting adder tree is thus equivalent to a reduction of the latter. Both are two
for the matrix-vector multiplication [12], [13]. While some sides of the same tapestry, which is utilized in the sequel.
approaches discuss finding an optimal exact solution to the The common way to represent matrices is to element-wise
matrix vector multiplication [12], there are also efforts to quantize their entries. The more accurate the quantization
reduce accuracy for further reduction in hardware cost of the of each entry, the more accurate is the whole matrix. The
resulting designs [13]. entries are typically quantized by the common signed inte-
While such dataflow architectures and their network opti- ger representation. Each additional binary digit halves the
mizations are also the main focus of this paper, they can average quantization error. This can be improved by Booth’s
be further combined with technology driven designs. This CSD representation [46]. Each CSD reduces√the average root
third main direction of research extensively utilizes uncon- mean-square quantization error by a factor 28 [16].
ventional or novel circuitry and memory cells. As such, one When implementing small CMMs, as they appear in, e.g.,
of the central structures are crossbar arrays, which usually DSP algorithms, the CSD representation brings further bene-
follow the general principle of dataflow architectures. They fits. Instead of implementing full multiplication units, we can
internally store the network weights and perform analog convert the sum of products (SOP), that represents the compu-
multiplications and additions as the information medium tation of one line of the CMM, into a directed acyclic graph
propagates through them [34], [35], [36]. Hereby, a num- (DAG) of adders which then can be minimized by reusing
ber of different technologies with their own benefits and intermediate results where possible [12]. As this problem
drawbacks have been investigated. On the still rather con- is NP-hard [12], finding good solutions for large matrices,
ventional side are designs based on capacitors [37] and as they appear, e.g., in ANNs, is not viable. In more recent
common non-volatile memory cells like flash and silicon- work, inaccurate implementations are considered, trading
oxide-nitride-oxide-silicon (SONOS) [38], which are already accuracy for even lower hardware costs [13].

3884 VOLUME 11, 2023


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

The element-wise CSD representation is simple, but leaves chosen such that Fs,0 and Ws have the same size. This ini-
much room for improvement. The coordinate rotation digital tialization works well for most weight matrices occurring,
computer (CORDIC) algorithm [47] represents 2×2 matrices in practice. However, it may perform poor in some excep-
as products of 2 × 2 matrix factors that only contain signed tional cases, e.g., for matrices that contain only positive or
powers of two and is used to improve the calculation of, e.g., only negative entries. In that case other initializations should
trigonometric functions. Recent work on linear computation be taken, see [15] for details.
coding in [15] shows that rectangular matrices are much bet- We calculate the matrix factor Fs,p given the previous
ter suited to be decomposed into matrix products than square matrix factors Fs,p−1 and the sub-matrix Ws . With M denot-
matrices. Furthermore, the savings grow unboundedly with ing the number of rows in Ws , p > 0, and some parameter E,
matrix size. This behavior was first observed for the particular we solve
example of the mailman algorithm [48]. While the latter is
f s,p,m = argmin ws,m − ϕFs,p−1 · · · Fs,0 (5)
too inflexible for practical applications, modern methods of 2
ϕ∈{0,±2Z }M :kϕk0 =E
linear computation coding work well for matrices of almost
any size and aimed accuracy of computation. The particular row-wise for all rows f s,p,m of Fs,p . There ws,m and
algorithm utilized in this work is detailed in the sequel. kϕk0 denote the m-th row of Ws and the number of non-zero
components in ϕ, respectively. We stop the recursion if the
III. OUR METHOD: COMPUTATION CODING - desired accuracy is reached, i.e. the Frobenius norm of the
DECOMPOSITION OF MATRICES difference between the approximation and the exact weight
Our objective is to decompose a matrix W in such a way that matrix is small enough. Thus, the desired accuracy deter-
the product Wv can be computed with minimum effort on an mines the number of non-trivial factors P. While the initial
FGPA. and trivial factor Fs,0 is rectangular having the same size as
The multiplicative decomposition algorithm in [15] works Ws , all subsequent factors Fs,1 to Fs,P are square.
much better for rectangular than for square matrices. There- The optimization problem (5) is NP-hard. Therefore,
fore, we first slice the matrix W into S tall sub-matrices Ws we resort to an approximate solution based on a quantized
as in (2). Similarly, the vector v is cut into S sub-vectors vs version of matching pursuit [50]. First, we find the first
† † †
such that v† = [v1 |v2 | . . . |vS ]. Thus, we have non-zero entry of the vector ϕ. For that purpose, we cal-
culate all matchings and quantize their scale factors to the
S
X most suitable signed powers of two. Then, we pick the best
Wv = Ws vs . (4) matching with respect to the Euclidean distance to the vector
s=1 ws,m . Given this first entry of ϕ, we find the second entry of ϕ.
Note that reference [15] slices the matrix W into wide, not We repeat that, until E non-zero entries are found.
tall sub-matrices. This requires the subsequent factorization An example of such a decomposition is given in the sequel.
algorithm to operate on the transposed matrices. Although Consider the matrix1
 
horizontal slicing results in a similar number of required 0.5377 0.3188
computations, it is less suited for pipelining: Vertical slicing  1.8339 −1.3077 
W1 =   −2.2588 −0.4336  . (6)

ensures that all computation paths have exactly the same
lengths, cf. equal number of nonzero entries in the rows in 0.8622 0.3426
(7). Horizontal slicing, however, results in varying lengths of
For P = 2 and E = 2, we approximate it as
computation paths, cf. equal number of nonzero entries in the  
columns in (7). With vertical slicing we ensure minimal clock 1
0 1 1 

skew in hardware implementation. All paths have the same  1 0 −
32
 1  2 4 
lengths and thus a minimal clock skew is guaranteed. Hori- − 1 0 0 2
 −1 
  −2 − 1  . (7)
2
 
zontal slicing on the other hand leads to varying paths lengths W1 ≈   1
 
and thus the clock skew increases. It is well established  0 − 1 0  2
 8 
1

that an optimized clock skew is key to designing efficient  1 1 
1
0 − − 0
hardware [49]. 16{z 2 }| {z 4 }
Each tall sub-matrix Ws is decomposed into P nontrivial |
F1,1 F1,0
F1,2
matrix factors Fs,p as denoted in (3). For this purpose, we use
a recursive approach to be detailed in the sequel. The recur- In order to approximate the matrix-vector product W1 v1 ,
sive approach is not optimal and more sophisticated decom- we first calculate the vector F1,1 F1,0 v1 which requires four
positions may yield even better results. However, it performs additions, and subsequently multiply this vector by Fs,2
well and allows for a matrix decomposition with reasonable which also requires four additions, so eight additions in total.
complexity.
We initialize the recursion with the trivial factor Fs,0 = 1 This matrix is found when executing the command A = randn(4,2)
[I|0]† with I and 0 denoting the identity and the all-zero right after starting Matlab. It is chosen to demonstrate that this example is not
matrix, respectively. The sizes of the matrices I and 0 are made up particularly to promote this algorithm, but it is a generic example.

VOLUME 11, 2023 3885


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

FIGURE 2. Comparison of data dependencies of the standard implementation and the computation coded computation of a CMM
using a 4 × 4 matrix. The colored data paths F1,0 to F1,2 of the CC implementation are also presented in Equation 7.

The signal-to-quantization noise ratio the edges connecting them depict the data dependencies. Each
entry of the resulting output vector is the sum of products
kW1 k2F
SQNR = (8) of input vector entries and the matrix entries in the corre-
kW1 − F1,2 F1,1 F1,0 k2F sponding column. Therefore, all entries of the result vector
for this example is given by 24 dB which corresponds to directly depend on all entries of the input vector. The structure
the accuracy of 4-bit signed-integer arithmetic. The SQNR in Figure 2a is generic for all 4 × 4 matrices that do not
measured in decibels was found in [15] to scale linearly with contain zeros. The particular properties of the constant matrix
the number of factors P, so any desired accuracy can be are encoded in the coefficients of the linear combinations at
reached. Note that a direct computation of W1 v1 , irrespec- the output nodes and are not visible in Figure 2a.
tive of the accuracy, would require four additions and eight The computation coded decomposition of the same matrix
multiplications. is visualized in Figure 2b for P = 3 factors. No data depen-
By design, any matrix factor Fs,p , p > 0 contains exactly dencies are lost by applying the proposed decomposition.
E nonzero elements per row. These E non-zero elements are Instead, previously direct data dependencies are exchanged
signed powers of two. Multiplying such a matrix to a vector, with indirect dependencies. The result of the CMM is not
thus, requires at most E shifts and exactly E − 1 additions computed directly from sums of products, but by repeated
(or subtractions) per row. For an M × N weight matrix, CMMs with CC-matrices for each slice of the original matrix,
these are M (E − 1) additions (or subtractions) for any matrix followed by accumulation of all slice approximations. The
factor Fs,p . In total, there are PS of these matrix factors. 4 × 4 matrix W of the CMM is sliced into S = 2 slices of
Moreover, we have (S −1)M additions for calculating the sum size 4 × 2. Decomposition of the first slice W1 is presented
in (4). Thus, the total number of additions and subtractions to in Equation 7 for P = 2 factors. Each slice computation is
compute Wv is now only dependent on the respective two elements of the
input vector. After P = 3 factors with one addition each
(E − 1)MPS + (S − 1)M . (9) (E = 2), the slice-wise computation is finished and the final
accumulation takes place.
The choices of the three parameters P, S, and E determine
Between the first and second matrix factor, a node appears
both the computational effort and the accuracy of the approx-
to be missing. Instead of four nodes, there are only three.
imation (3). Setting
A one-to-one translation of (7) would actually make this
S ≈ N / log2 M (10) fourth node to show up. However, due to the all-zero column
in F1,2 , this node has not any outgoing edges. Thus, it does not
is typically not a bad choice. The optimum value of S often have any influence on the final result and can be eliminated.
deviates from (10) by at most a factor of two in one or Optimizing VDHL compilers remove such nodes automati-
the other direction. For given parameter S, the parameters P cally. See also [51] for details.
and E are chosen such as to reach the desired accuracy of The two data dependency graphs mainly differ in two
computation. points:
In the standard approach of computing CMM, every output
is a direct linear combination of any input. This is visualized • For CC, the structure of the graph depends on constant
in Figure 2a, where every node represents a vector entry and matrix W, while for the standard approach, it does not.

3886 VOLUME 11, 2023


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

CC encodes the information about the matrix W pre- The standard implementation of a CMM consists of two
dominantly in the structure of the graph and only to a steps, the multiplication itself and the column-wise accumu-
minor extent in the weights of the linear combinations lation per entry of the result vector. Consider the product
at the nodes.
z = Wv (11)
• For CC, all nodes have a fixed number of incoming
edges which can be freely chosen by the design variables where W ∈ RM ×N , v ∈ RN and z ∈ RM . When implement-
E and S. For the standard approach, however, the number ing the product (11) in a naive architecture, the computation
of incoming edges is equal to the number of rows of the can be separated into two distinct steps, 1) the multiplications
matrix W. themselves and 2) row-wise accumulations. Thus, we can
Our approach allows, due to choice of the parameters define the intermediate matrix W0 ∈ RM ×N whose rows
E and S, to design the number of incoming edges to computa- w0m = wm v are the element-wise (Hadamard) products
tion nodes freely. Thus, we reduce the node activity from four of the rows of W with the input vector v. As explained, now
edges to two in our example. This reduction in node activity we need to row-wise accumulate thePmatrix W0 to compute
is much more pronounced for larger matrices. For the purpose the resulting vector z with zm = N 0
n=1 Wm,n . As already
of readability we chose to present this small 4 × 4 example. alluded to, we want to modify the product (11) to simplify
The matrix decomposition described above is not the only the hardware required to implement it. Instead of using the
sensible method of linear computation coding. A recent original matrix W, we make use of the approximate matrix
alternative requiring even less additions is reported in [51]. decomposition algorithm presented in Section III. This results
Whether the method in [51] is also well suited for implemen- in the approximation of W such that
tation on FPGAs is to be explored in future work.
S Y
X P
Wv ≈ Fs,p v (12)
IV. OUR METHOD: ARCHITECTURE AND
s=1 p=0
HARDWARE-REALIZATION
In this section, we propose an architecture for implementing where Fs,p ∈ RM ×M for p > 0. There are a few parameters
CC-matrix-vector products. It utilizes the particular proper- that determine the number of matrix-vector products needed
ties of the CC-matrices. Results of several experiments on to implement this decomposition. The algorithm decomposes
the scalability of our architecture are presented and further W into slices of width W , as shown in (13).
aspects needed for the real-world implementation are elabo- N
rated upon. W = (13)
Our objective is to design an optimized architecture imple- S
menting MLPs, as a general form of ANNs, that can be real- Thus, with increasing width W the number of slices
ized on FPGAs. A MLP is a sequence of neural layers, each decreases. The parameters P and E are used to control the
layer consisting of a set of neurons with activation functions. accuracy of the approximate decomposition which increases
The resulting activations of a layer can be computed element- with P and E meaning that more factors and less sparsity in
wise or, when represented as a vector, using a matrix-vector these factors yield a more precise result. Typically we want to
product concatenated with a non-linear activation function set P and E such that we perform (at least) equally accurate
as shown in (1). There, a is the resulting activation of the as the integer-arithmetic used by the naive implementation.
current layer with weight matrix W, input v, bias b, and Each of the matrices Fs,p is a CC-matrix with the following
activation function φ. The inputs to a layer are the activations properties that can be controlled by the algorithm:
of the previous layer or, in the case of the first layer, the • There is a fixed number of elements that are unequal to
input to the MLP itself. Disregarding the activation function, zero in each row of the matrix.
it is immediately obvious that the matrix-vector product is • The domain of values that matrix entries can be is fixed
the most computationally expensive component of (1). Thus, to a finite set.
when designing an optimized MLP architecture, it is crucial The proposed architecture which is depicted in Fig. 3 benefits
to focus on said multiplication. This coincides with the imple- from both points mentioned and the following paragraph
mentation for general CMMs, as the ANN design consists, explains how both constraints are exploited.
next to other parts, of CMM units. Our approach replaces We restrict each row of the matrix to consist of exactly two
the original CMM with multiple CC-matrix-vector products, non-zero elements, i.e. E = 2. As each element of the output
or in other words CMMs where the underlying matrices are vector zm is calculated as the inner product of two vectors with
CC-matrices and are created using the approximate matrix one of them containing only two non-zero entries, we only
decomposition algorithm presented in Section III. need one addition to compute zm . This holds for any of the M
components of z, so there are M additions needed in total for
A. ARCHITECTURE this step. When implementing a general matrix vector product
This section will present our architecture starting with one needs to choose between a linear adder and a tree adder
explaining the components of a CMM and discussing benefits effectively choosing between minimizing hardware cost and
stemming from certain restrictions to them. critical path length. To implement a matrix vector product

VOLUME 11, 2023 3887


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

FIGURE 4. Architecture of approximate matrix-vector product


(CMVM-Block) Wv = z where W is decomposed into the CC-matrices F1,1
to FS,P . The input for Fs,1 is the s-th part of v separated into S slices and
zeros such that the vector has N elements. After parallel computation, the
partial results zs are accumulated to z.

beginning and choosing between the inverted and the original


input vector at the time of shifting. From an overall per-
FIGURE 3. Architecture of CC-matrix-vector product (F-Block) Fv = z with
input v (left) and output z (right). Additional multiplications with −1 are spective an implementation of a CC-matrix-vector product
taken care of by the inverter-module denoted by inv. The blue shifters can compared to a naive implementation of a general product has
have varying implementations while the black part is fixed. Depending on
the type of implementation of the blue part, the red network is needed or
a significantly lower hardware cost and critical path length.
not. In our implementation the blue and red parts are replaced by As was pointed out, architecture in Fig. 3 only implements
hard-wired shifts. the CMM for CC-matrices. To implement a full product we
need to assemble multiple instances of the mentioned archi-
tecture as shown in Fig. 4. The architecture can be divided
with the described restriction we only need one adder per into three sections, construction of input vectors, multiplica-
matrix row optimizing both hardware cost and critical path tion with CC-matrices and accumulation of partial results.
length at the same time. The construction of the input vectors is needed as a first
As a side note, with an increase in E the number of adders step, because each row of CC-matrix-vector products only
required to accumulate the intermediate results per row may approximates a slice of the original matrix. Thus, we only
increase. The optimization problem here is between mini- need the corresponding section of the input vector v. To match
mizing hardware cost by choosing a linear adder structure or the dimensions of the matrices Fs,p ∈ RM ×M for p > 0,
minimizing the critical path by choosing tree adders. While the partial input vector gets multiplied with an identity matrix
E drives hardware cost per CC-matrix product, the total augmented by zeros. This is formally done in (3) by the initial
hardware cost is balanced out by the need of less sequen- matrix factor Fs,0 . This can be shortened to filling up the
tial products. Due to more information being stored in each remaining bits with zeros. This is done in the leftmost section
CC-matrix the number P of CC-matrices required to reach a of Fig. 4.
certain precision decreases. After having assembled the partial input vectors, an array
The main benefit of our approach compared to a naive of CC-matrix-vector products follows. Each of these imple-
implementation results from the second bullet point men- ments the architecture presented previously. Each row is
tioned above. By restricting all non-zero matrix entries to implemented as a chain of products running in parallel to
be signed powers of two, we need not any multiplication other rows.
elements to implement the matrix-vector product. As num- As each row of products only represents a subset of
bers are encoded binary, a multiplication with a power of columns of the original weight matrix, or the underlying
two is nothing but a shift. There are various possibilities to matrix of general CMMs, the results of a row of CC-matrix-
implement these shifts. Barrel shifters enable shifting in both vector-products is only a partial result. To get the final output
directions and thus are one way of implementing the required vector all partial results zs need to be accumulated which is
computation. The main benefit of this approach is that the best done in a binary tree structure. This approach minimizes
implementation is independent of matrix values as matrix the critical path length at the cost of more hardware to imple-
elements are the controlling input of the shifters and can be ment it when compared to a linear addition.
read from memory. When assuming the matrices as fixed, As was explained in Section III, the decomposition of a
we can skip the shifters and hard-wire the shifts using simple matrix via the CC-algorithm is only approximate. The more
connections between the input vector and the adders. consecutive factors there are per slice, the higher the accuracy
At last, as we do not restrict the matrices to consist of of the approximation [15]. To achieve viability compared
positive values only, we need a way to handle negative matrix to other, competing implementations of CMMs we simply
entries. This is done by inverting the input vector at the use as many factors in the decomposition to reach the same

3888 VOLUME 11, 2023


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

or better accuracy which the fixed-point arithmetic of the approach to pipelining the architecture demonstrating how
competing implementation would provide. When we want to we can make use of the repetitive architecture and optimize
approximate, e.g., 8 bit signed integer arithmetic in this way, critical paths.
we need to set the amount of computations per slice such that A particular problem is the well-known memory bottle-
the quantization error is 48 dB below the entries of the weight neck, i.e. to enable our architecture to compute fast we
matrix [52]. require high data throughput. A matrix with dimensions 64 ×
Our designs are implemented using the Very-High-Speed 64 already requires as input-output-(IO)-ports two vectors
Integrated Circuit Hardware Description Language (VHDL) with 64 entries resulting, when encoded in 8 bit, 1024 bit
which is generated from the output of the decomposition transferred every clock cycle. At a frequency of 400 MHz
algorithm using a hardware generator we implemented in we need a memory bandwidth of 400 Gbit/s. To solve this
python. This choice best bridges the semantic gap between requirement we chose to implement our designs for the
the output and corresponding interface of the decomposition XCVU37-ES1 chip by Xilinx on the ADM-PCIE-9H7 board
algorithm and the hardware descriptions required for syn- by Alpha Delta. This setup is consistent for all the following
thesis while only relying on basic assertions and operations results.
supported by most VHDL standards and also comes with Our design is a fully rolled-out implementation and thus
the benefit of providing an interface for common neural net executes the corresponding CMMs in one clock-cycle. There-
frameworks such as Tensorflow or PyTorch. fore, we compare our architecture to a fully rolled-out version
The goal of our implementation is a tool that generates the of the naive implementation. As a basis for comparison of
description of the instantiated designs as shown above while hardware cost we implement both the standard approach and
sustaining compatibility with most synthesis tools, not only our CC-approach using look-up-tables (LUTs) and do not use
for FPGA implementations but for ASIC implementations as any DSPs present on this specific FPGA. By doing this we
well. Due to this, we implement our designs in the VHDL- can guarantee a fair comparison in terms of the validity of the
93 standard [53] which is supported by most synthesis tools. results as well as the applicability to other FPGA boards.
This choice comes with the drawback that the VHDL-93 Floating-point computations introduce a level of accuracy
standard lacks features of more modern versions which makes which can be used as a termination criterion for the decom-
implementations based on it unnecessarily complex and hard position of matrices, if desired. This way, similar to how
to read. Therefore we have chosen to implement a hardware we achieve the accuracy of fixed point computations, the
generator instead of relying on static VHDL implementations approximative decomposition provided by the CC-algorithm
of our designs. As mentioned our generator is implemented can then be as accurate as floating point arithmetic.
in python and consists of a generic VHDL generator back- Our designs are individual dataflow architectures that
end and several functions using the backend to generate the implement an entire CMM, ANN layer, or ANN, respectively.
descriptions of the designs for specified input matrices and There is no need for mapping algorithms for processing
parameters. Thus, the resulting interface consists of python elements. Results concerning speedup can be found in the
functions which can either be called individually but can also experiment concerning pipelining in Section IV-B3, the main
be connected to neural net interfaces as mentioned previously. experiment which discusses timing results, as well as in the
In-between the decomposition algorithm is executed, this also last implementation realizing a MLP in Section VI.
takes place on the python-level.
This architecture implements an approximate CMM, with 1) MATRIX DIMENSIONS
the approximation being at least as accurate as comparable One key aspect of the performance of our architecture is its
fixed-point arithmetic. The resource-efficiency we achieve scalability in terms of varying matrix dimensions and the cor-
is not at the cost of a lower throughput. It arises from suit- responding benefit when compared to the naive implementa-
ably quantizing matrices rather than naively quantizing their tion. This facet will be explored in the following experiment.
entries. Therefore, it can replace the naive implementation As we want to represent matrices appearing in ANNs we
of CMMs without hindering accuracy or throughput. In the chose to test our approach on square matrices with dimen-
following the potential of the presented architecture will be sions ranging from 64 × 64 to 256 × 256. To keep generality
explored with the main focus on properties of weight matrices we randomly generated matrices with independent uniformly
of ANNs. distributed entries. An experiment on varying statistics of
entries is presented in Section IV-B2.
B. SCALABILITY The main choices left before running the linear computa-
There are several factors that affect the scalability of our tion coding algorithm is the precision we want to achieve and
architecture for a matrix-vector product. Apart from opti- the size of the matrix slices to approximate. We compare our
mizations to the architecture and the ease of applying them, results to a fixed-integer arithmetic naive implementation of a
we can also explore the effects of variable matrix traits. The matrix-vector product with a bit width of 8 bit. The bitwidth
latter will be explored in the upcoming two experiments of all vector entries between matrices, meaning the in- and
which consider the impact of matrix dimensions as well as outgoing vectors of the corresponding matrices, is set to 8 bit.
the statistics of matrix entries. After that we present our This determines the precision we need to achieve. According

VOLUME 11, 2023 3889


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

TABLE 1. S: Number of vertical slices per matrix, P: number of Note that this factor is not dependent on the number of
consecutive products per matrix slice, W : width of each slice. The
standard approach (STD) implements a naive matrix-vector product with columns of the matrix due to the slicing applied to it. By slic-
fixed-point arithmetic with a bit width of 8. The precision of said bit ing the matrix we achieve that the number of rows dominates
width is achieved by the computation coding (CC) decomposition
resulting in S consecutive products per slice. The column I = STD LUTs the number of columns, e.g. in Table 1 by a factor of 16, 32 or
CC LUTs
represents the improvement of our approach over the naive one. 64, resulting in the observed relation between implementation
cost and matrix rows. Had we sliced the weight matrix hor-
izontally instead of vertically, the factor would be 12 log2 N
depending on the number of columns instead of rows.

2) MATRIX ENTRY STATISTICS


In the previous experiment we considered matrices with vary-
ing dimensions but all of the corresponding matrices have
uniformly distributed entries. The overall goal of our research
is to achieve a fair and well-founded comparison between our
CC-approach and other competing implementation methods.
To be able to achieve such a comparison with our results we
need to point out that in the real world not all matrices feature
a uniform entry distribution.
There are well known optimizations for special kinds of
to (10) and (13), the slice width should be around 5 to 8. multiplications. Consider the product shown in 14 where x is
We chose the three options W = 2, W = 4, and W = 8, supposed to be a fixed value.
as they lead to numbers of slices which are powers of two, z = xv (14)
thus simplifying the adder trees in Figure 4.
The results of this experiment are presented in Table 1. To implement 14 in hardware, generally a multiplication unit
Note that there are no multiplication units and all addition consisting of several adders is needed. The same is not true
units are implemented as LUTs. It is immediately obvious for special values of x.
that our approach outperforms the standard implementation • x = 0: There is no multiplication needed, the result of 14
in every case. The factor by which our implementation is is z = 0.
better in terms of hardware cost measured in LUTs required • x = 1: There is no multiplication needed, the result of
for implementation ranges from 2.3 to 3.4 for the best slice Equation 14 is z = v.
width shown in the table. The amount of adders required to • x = 2y , y ∈ Z: As numbers in hardware are represented
implement our approach depends on the matrix dimension in the binary system, there is no multiplication needed.
and the precision we want to achieve. Counting multipliers The result
Pj can be computed by shifting v by y digits.
• x = 2y , i, j ∈ Z: This is the representation of a
as multiple adders, we can expect a theoretical factor of y=i
1 binary number consisting of 0-bits and one continuous
2 log2 M for the benefit in terms of the number of adders
of our approach compared to a naive implementation for an sequence of 1-bits. Using a general multiplication j − i
M × N matrix [15]. This theoretical factor is also approxi- values have to be accumulated to calculate Equation 14.
mately reflected in our results, where we count LUTs instead The CSD representation provides the optimization x =
of adders. Further we learn that the precise slice width used 2j+1 − 2i . For sequences of more than two 1-bits the
in the decomposition of the original matrix has only minor amount of additions needed can be reduced to just one.
impact. The number of consecutive CC-products P required This comes with no added hardware cost, as subtractions
to achieve the desired precision decreases with the number in hardware can be realized by addition units with no
of slices S. This compensates for the increased hardware cost extra expense. Such a multiplication unit is called a
stemming from a larger number of slices. Booth-Multiplier.
The theoretical improvement factor 21 log2 M grows with Concluding the list of optimizations, the actual matrix entries
matrix size. For slice width W = 8, this is confirmed by the in a CMM have a large influence on the implementation cost
LUTs count in Table 1. The optimum theoretical slice width, of the entire CMM. This observation is the motivation for
however, is not a power of two, in general, but may be of the following experiment. We explore the impact of varying
course, in particular cases. This can explain why one result 0-1-bit ratios on the improvement that the CC-approach pro-
in Table 1 sticks out with particularly excellent performance, vides over the standard implementation of CMMs.
i.e. M = N = 64 and W = 4. Here, the optimum slice width The uniform distribution that was used in the previous
seems to be close to W = 4, while it does not match the grid experiment is now represented by a 50% 0-bit matrix. Such a
W ∈ {2, 4, 8} for larger matrices. uniform distribution is the worst case for multiplication, as it
Our approach outperforms a naive implementation of a features the highest number of additions required for imple-
M -by-N -matrix-vector product by a factor close to 12 log2 M mentation while also providing the least amount of Booth-
mainly depending on M , the number of rows of the matrix. Multipliers. Small 0-bit ratios feature more Booth-Multipliers

3890 VOLUME 11, 2023


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

TABLE 2. This data compares the hardware complexity of a naive For matrices consisting only of 0-bits or only of 1-bits the
implementation of a matrix-vector product compared to our computation
coding approach for matrices with varying percentage of 0-bits. Each standard approach also does not require multiplication units.
matrix has the dimension 64 × 64 and is encoded in 8-bit to compute the The improvement provided by the CC-approach over the
metric. The factor I = STD LUTs is the improvement of our computation
CC LUTs standard implementation does not drastically decrease with
coding approach over the naive implementation. Each matrix is sliced
with a slice width of W = 4 resulting in S = 16 slices each. The number of deviations from a uniform entry distribution. It is even larger
factors is displayed in column P. for moderately small percentages of 0-bits. Only for extreme
bit ratios, e.g. ≥ 80% or < 5% 0-bits we see a notable
reduction in the improvement factor compared to the uniform
case. This demonstrates that the CC-approach leads to a great
improvement over the standard implementation for a wide
range of entry statistics. Uniformly distributed entries are not
required for a large improvement by our method.
Performance of the CC-approach is not hindered by minor
deviations from the uniform distribution. A sufficiently bal-
anced bit distribution leads to a clear improvement of about
three times over the standard implementation. For a not too
extreme bias towards 1-bits, the improvement can even be
larger.

3) PIPELINING
There are various approaches to implement a pipeline into
the architecture seen above. The traditional approach is to
pipeline the architecture top-down. This means to insert
pipeline-registers between each CC-matrix-vector product,
further between each matrix-vector product, eventually
between the various computational steps in each layer,
and between the layers themselves. An abstract illustration
of such a hierarchical pipelining approach is presented in
Figure 6. Hierarchical pipelining without further synchro-
nization is possible as each row of multipliers has the same
FIGURE 5. This plot visualizes the results presented in Table 2. number of elements and thus every path through said multi-
pliers has the same length.
As alternative to the hierarchical approach it is also pos-
sible to implement a bottom-up approach. For bottom-up
in the standard implementation while higher 0-bit ratios pipelining we consider the architecture as an entirely rolled
require less additions overall. All matrices have a fixed out net with adders as base-building blocks and partition it
dimension of 64 × 64 and are sliced with a slice-width of into subnets with equal critical path lengths. It is possible
W = 4. Again, the accuracy that is used to determine the to use adders as the atomic units of this process because our
termination criterion of the decomposition algorithm is the proposed designs for CMMs is made up of adders only and
quantization error of 8 bit fixed-point arithmetic (−47 dB). does not rely on multiplication units. As all paths through the
The matrices are generated starting with the zero-matrix by architecture have the same length and thus every path being
distributing a certain number of 1-bits uniformly over all the critical path, synchronization between the different paths
binary representations of entries. Table 2 presents the compar- is not necessary.
ison of hardware cost of the CC-approach versus the standard It is possible to create pipeline steps not only between
implementation for matrices with various 0-bit ratios. CC-matrix-vector multiplications but also inside the compu-
As expected the implementation of a nonuniform matrix is tation units implementing said products themselves without
not as expensive as one for a uniform matrix which is true the need for additional synchronization. As for the imple-
for both the naive approach being marked as STD in Table 2 mentation of each individual CC-product, the critical path
as well as our architecture. These results are graphically length remains constant for each multiplication. This is due
presented in Figure 5. We can also see that in general our to each matrix row requiring the same number of multiplica-
approach is better by a factor of 3 to 4.5 compared to the tions with elements of the input vector and each individual
naive implementation with some anomalies on the edge cases. multiplication being realized as a shift only. With these static
In the case of a matrix only consisting of 1-bits we see that properties there is little variance in path length over all paths
the naive implementation is actually better. This is due to the in the implementation of a CC-matrix-vector product.
matrix decomposition only being approximate and the naive The only difference between the hierarchical and the
implementation making use of the static pattern of the matrix. net-partitioning approach to pipelining is the amount of

VOLUME 11, 2023 3891


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

FIGURE 6. This figure shows an abstract approach to pipelining which is being implemented in our architecture. Each register is depicted in blue.
A pipeline step spans a CC-matrix-vector product, a bias addition or a nonlinear activation function (e.g. here RELU). The CMVM units represent a
CC-matrix-vector product each making use of our optimized approach.

registers required. Depending on the placement of a pipeline TABLE 3. This data compares the number of LUTs, the number of
flip-flops (FFs), and the maximal frequencies for various designs for the
step the number of signals that need to be buffered varies. decompositions of a 64 × 64 matrix with uniformly distributed entries.
A set of registers placed as a pre- or postfix to a CMVM Each frequency is found using the bisection method starting with
100 MHz. The decomposed matrix approximates the original up to an
unit requires only the corresponding input or output vectors error similar to fixed-point 8-bit arithmetic of a naive approach.
to be stored. When we cut a CMVM unit itself in partitions,
the pipeline registers in-between the stages have to store the
intermediate vectors of all concerned slice approximation
datapaths.
Our approach to pipelining sees the multiplication as an
unfolded net and simply inserts pipeline steps such that the
critical path of each step has the same length. In the case
of a fixed matrix this benefits highly from the architecture The amount of additional registers per added pipeline
only being made up of adders, as shifts can be hard-wired. step depends on the positioning of the step. While registers
Therefore, an optimal pipeline distribution becomes possible in-between layers or generally outside of the matrix-vector
and can even be computed beforehand. To explore the effects product result in a small increase of the register count, having
of pipelining in our architecture we compare randomly gen- pipeline steps inside the multiplication unit is more expen-
erated matrices for uniformly distributed entries with various sive. This is due to the parallel rows of computation which
counts of pipeline steps each. require to put registers in every row. Still both types of
Next to the resulting hardware complexity for each product pipeline steps lead to a linear increase in required registers.
the most important results are the corresponding frequencies With an increase in pipeline steps the maximal frequen-
that the implementations can be run at. Said maximal fre- cies of the according designs increase, reaching a peak at
quency is determined by the critical path length, the longest about 400 MHz. The maximal frequency is the same for
run of gates between two registers. To determine the optimal implementations requiring more sequential CC-matrix-vector
frequency we make use of the bisection method. For each products as for implementations with fewer ones, as the
implementation run of our architecture we set a fixed timing minimal pipeline steps only depend on the greatest atomic
goal. After the implementation we determine the difference in units in the chain which are adders in both cases. The only
timing between the goal and the required time for the critical difference in the resulting implementations for the two cases
path. According to the gathered information we adjust the are the number of pipeline steps.
timing goal until the absolute difference passes a termination Note that with an increase in pipeline steps the initiation
threshold giving us the maximal frequency of the correspond- period of the overall pipeline also increases by the same
ing design. amount of clock cycles. Table 3 show increases from one
The test procedure was repeated for a set of amounts of to eight and 14 pipeline steps respectively. The initiation
pipeline steps for a 64 × 64 matrix with two respective period of the corresponding implementation also increases to
approximate decompositions. For all our results the width of eight and 14 cycles. These clock cycles are shorter then the
the vector entries is set to 8 bit. Each decomposition requires clock cycles of the design without pipeline steps, reducing
a different amount of concatenated CC-products per row of the effects of said impact. After the initiation of the pipeline
computation to reach 8-bit integer calculation precision. the architecture is back to the single-cycle execution of the
The results of this experiment are presented in Table 3 corresponding F-Blocks.
where several observations can be made. First of all, the hard- As is described in Section IV-A, our designs are generated
ware cost increases with the increasing number of pipeline implementations using our own python VHDL generator
steps where the LUT counts required for implementation and do not rely on existing high-level synthesis (HLS) tools.
are about constant but the number or required registers Thus existing loop initiation algorithms and procedures are
increases. not applicable to our architecture. A high number of pipeline

3892 VOLUME 11, 2023


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

steps does not harm the efficiency as there are no hazards V. EVALUATION USING AN EXAMPLE DLRM
occurring during computation. Overall these improvements For the purpose of analyzing our architecture, we chose to use
gained by introducing pipelining to our designs lead to a a recommender system as example of an ANN. These systems
speedup of 3.7 and 4.2 respectively. With initiation peri- are utilized by various companies, e.g., for streaming services
ods being directly related to the number of pipeline steps, to give their customers advice about movies they may like
pipelining the architecture only leads to an improvement based on their consumer behavior. During the last years this
when more then a single calculation is performed. The more systems have become increasingly reliable in their forecasts
vectors are passed through the hardware, the less the initiation not least because of the more frequent use of algorithmic
cycles impact the total number of cycles required for the models aided by multilayer perceptron (MLP) concepts. One
overall computation. Reconfiguring the FPGA to accommo- of these algorithms was implemented recently in 2019 by the
date instances of a partitioned implementation can cut this Deep Learning Recommendation Model for Personalization
pipelining improvement. Therefore, it is best to implement and Recommendation Systems (DLRM) [54].
the fully rolled-out net as a whole or buffer input data.
As is shown in Section IV-B1, the required amount of
sequential factors to achieve a certain accuracy is not only A. PRINCIPLES OF RECOMMENDATION NETWORKS
dependent on the slice width but also on the matrix dimen- In order to better understand the value of this model’s sin-
sions. The results in Table 1 show that for a fixed slice gle components, we first give a short introduction on the
width said number only varies slightly while for varying slice principles of recommendation networks. Recommendations
widths it changes drastically. Therefore, we can conclude that today are given based on two underlying principles namely
the results shown in Table 3 are similar for matrices with content-based filtering and collaborative filtering. While the
larger dimensions. Even if more or less sequential factors are former approach bases its prediction on the users’ own pref-
required to achieve the desired accuracy, only the number of erences, collaborative filtering tries to imply a solution based
initiation steps for the matrix changes, but the maximal clock on the preferences of similar users. One of the first systems
frequency does only vary marginally. This is also reflected in making advantage of both of these concepts was the fac-
the similar maximal frequencies of the two explored designs torization machine. Its prediction formula consists of two
in Table 3. parts, the regressive and the matrix factorization one. The
regression handles both sparse and dense data of the feature
4) GENERAL COMPARISON WITH OTHER CSD-ALGORITHMS vector and can consequently be seen as the content-based
As was already alluded to earlier, there are algorithms aiming filtering part of the system. The matrix factorization on the
to lower the computational effort and thereby hardware cost other hand accounts for the interactions between feature
of the corresponding implementations for CMMs, in partic- blocks, which represents the collaborative filtering approach.
ular [12], [13]. The general approach of said algorithms is Even though both of these models are already integrated in
to convert matrix entries to CSD and thereby minimize the this straight forward implementation, results can be further
number of non-zero bits appearing in the matrices [12]. The refined by making use of MLP layers. Due to its non-linearity
resulting additions of a SOP in a line of the CMVM unit it is possible for MLPs to learn even higher degrees of
is then represented as a DAG of adders which then can be interactions.
minimized. DLRM now brings those ideas together and introduces a
Said minimization problem is NP-hard [12]. Especially new concept by separating the features into dense continuous
for matrices with large dimensions, as are, e.g., used in our and sparse categorical features, which are represented by
benchmarks earlier, it is not feasible to find the optimal embedding vectors of the same size. The dense features are
solution to this minimization problem. Thus, greedy searches then fed into a bottom MLP which transforms them into an
are used to approach the optimum [12] or inaccuracy is intro- intermediate vector of the same size as the embedding vectors
duced to the calculation [13]. This results in an improvement of the categorical features before. Similar to the factorization
of computational complexity of the CMM which in turn is machine in the second stage, now the dot product between
reflected in more hardware resource-efficient implementa- the embedding vectors and the output of the bottom MLP
tions. The results presented by Kumm et al. [12] feature an is computed, which represents the computation of second-
improvement of up to 34% while the results presented by order interactions of different features. The products are then
Aksoy and Flores [13] similarly reach an improvement of up concatenated to the result from the bottom MLP and fed into
to 30%. another top MLP and finally to a sigmoid function in order to
Our proposed method reaches improvement factors of obtain a probability.
three to five times, or in other words saves 67% to 80% hard- In order to test our approach, we exchanged the weights in
ware cost while not hindering throughput. In summary, our the MLP layers of an already trained DLRM network with the
linear computation coding approach produces better results ones obtained by the utilization of our matrix decomposition
than current state-of-the-art (SoA) algorithms. algorithm.

VOLUME 11, 2023 3893


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

TABLE 4. Comparison of the hardware costs of implementing the


matrix-vector products of each layer of the DLRM. In this table, Layer
denotes the name of the corresponding layer, with an M × N
weight-matrix. S is the number of slices with width W = 4 and P the
CC-LUTs is the factor by
number of consecutive CC-matrix-products. I = STD-LUTs
which our approach improves the standard (naive) implementation.

FIGURE 7. This figure shows an abstract representation of a MLP design


used to classify the MNIST dataset. It reshapes the input images and
passes them through four fully connected layers the dimensions of which
can be found in the image.

TABLE 5. Test accuracy of the net presented in Figure 7 after 50 epochs of


B. COMPARISON training with different methods of computation. Compared are the
standard floating point method included with pytorch, an 8 bit fixed point
As a basis for comparison, we chose the same hardware implementation and the corresponding CC-decomposition. The target
platform as for all other experiments presented above. This accuracy of the decomposition is set to 48dB, such that the approximation
is equally or more accurate than the 8 bit fixed point calculation.
means we synthesized our design for the XCVU37-ES1 chip
by Xilinx on the ADM-PCIE-9H7 board by Alpha Delta.
First we look at a layer-by-layer comparison of our approach
and a naive implementation of a trained ANN as described
previously. The results are displayed in Table 4. Again it
is immediately obvious that our approach performs better
than the naive implementation with the improvement factor
varying between 2x and 6x. It is notable that in the Bottom-2 corresponding weight matrices being of size 784×64, 64×64,
layer said factor is very high compared to other results. This 64 × 64, and, 64 × 10, respectively, each followed by tanh-
is due to properties of the matrix used in this layer. With the activation functions. To accommodate the 28 × 28 greyscale
underlying matrix being a 64 × 256 matrix it is quite big input images of handwritten digits we reshape the input to
compared to, e.g., the next layer only featuring a 16×64 layer. a vector of dimension 784. The resulting classification is
On top of that the matrix is not at all sparse leading to an achieved by sorting the images into ten categories, one for
overall high improvement over the naive implementation. The each digit, hence the initial and final dimensions of the weight
Bottom-1 layer features an even larger 256 × 512 matrix, but matrices of the layers of the net. With this setup, a learning
it is not as dense as the matrix of the Bottom-2 layer. Thus, rate of 0.001, and a batch size of 32, an average reliability of
the improvement of 1.9 x of using our approach compared classification between 90 % and 97 % can be achieved after
to a naive implementation is not as high. Overall both the 30 generations.
naive implementation and our approach require an enormous We trained a net with these parameters and achieved
amount of LUTs to be implemented on a FPGA, but summing an accuracy saturation at 94 % in the test-dataset after
up all layers our approach saves 60 % of the hardware cost. 30 epochs. Fixed point calculation methods, as well as the
As mentioned before, pipelining the resulting architecture is CC-decomposition approximating it, introduce an additional
very efficient for our approach as the registers can be placed quantization error. The influence of the this quantization error
in a way that all paths through the pipeline step have the same is small, as shown in Table 5. Table 5 shows a comparison
length. This cannot be said for the naive implementation as a between the default floating point computation method pro-
comparable assurance cannot be made. vided by the PyTorch framework, as well as a fixed point
implementation and a corresponding approximation using the
VI. COMPARISON WITH AN EXAMPLE MLP Computation Coding algorithm presented in Section III. The
Previous examples explored the layer by layer performance of results show only a minor decrease in reliability of classifica-
the CC-approach in comparison to a standard implementation tion by the net when changing the computation method. Note
of CMMs on the premise of single fixed matrices. This is also that not only the quantization error, but also the additional
true for the previously presented DLRM which is explored on inaccuracy introduced by using our decomposition compared
a layer-by-layer basis as the high amount of LUTs required to the standard floating point approach is lower than for the
to implement it as a whole is to high for FPGAs. With this fixed point implementation.
final example we present a MLP which can be placed and Similar to previous examples we use the weight matrices
implemented on a FPGA. Similar to previous research [55], of the layers as the basis for CMMs which then are imple-
we also choose to design a MLP to classify the Modified mented using our approach and a standard approach. As target
National Institute of Standards and Technology (MNIST) accuracy the bitwidth of 8 bit is chosen and again the same
dataset. board from Alpha Delta based on the ADM-PCIE-9H7 chip
An abstract representation of our net design can be by Xilinx is used as platform for implementation. For the
seen in Figure 7. Our net consists of four layers with the decomposition of the matrices we tried various different slice

3894 VOLUME 11, 2023


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

TABLE 6. This table shows the hardware cost in LUTs for our CC-approach TABLE 7. Dynamic power analysis of the CC-approach and the standard
compared to the standard implementation. The decomposition arguments implementation. The frequency is fixed at 50 MHz. The analysis is run
are the slice width W and the corresponding slice count S as well as the with a toggle rate of 12.5% and a static probability of 0.5. Column CC
amount of factors P used. The last row shows the total cost including displays the dynamic power draw of the CC-approach, column STD the
implementations of tanh-activation functions, while the other rows same of the standard implementation. I denotes the improvement.
present layer-by-layer results of the individual weight matrix CMMs.
I = STD LUTs describes the improvement of our approach over the
CC LUTs
standard implementation.

that would require more than one addition per row and thus
there is an extra pipeline step introduced. Paths where this
problem does not occur require an additional buffer stage in
the corresponding pipeline stages. Additional hardware cost
is introduced by this necessary buffering. Still this decom-
widths of W = 2, W = 4 and W = 8, the number of position requires less LUTs then the implementation of the
factors for each decomposition is chosen to match the desired corresponding vertically sliced decomposition due to the
accuracy of 47 dB. extreme dimension-ratio of the underlying matrices.
The results of this experiment with the optimal configu- Eventually we now also introduce the energy aspect. For a
ration of slice widths for the decomposition on a layer-by- power comparison between the standard implementation and
layer basis can be seen in Table 6. According to our findings our architecture we setup a layer-wise comparison between
the optimal slice widths for the decomposition of the weight the two methods. With a fixed throughput we enable a fair
matrices of the first three layers is W = 4 while the last comparison between the two combinatorial designs, i.e. both
weight matrix is sliced in five slices of width W = 2. Note designs are not pipelined. The last point of comparison which
that the first, as well as the last layer have extreme dimensions is missing is the comparison of power requirements between
in the sense that their corresponding horizontal and vertical the two implementations. We ran a power analysis based
dimensions differ a lot. Therefore, to achieve better results on subnet switching activity provided by a post-synthesis
these matrices are not sliced horizontally, as opposed to the simulation the results of which can be seen in Table 7. The
weight matrices of the remaining two layers, yielding more frequency is fixed to 50 MHz to generate results for equal
tall and narrow matrix slices. In the layer-by-layer analysis through-puts. Further parameters are kept default with a tog-
of the implementation costs of the second and third layer we gle rate of 12.5% and a static probability of 0.5. Following
improve by a factor of 2.7 x. Previous results, i.e. Table 2, this setup we compared the implementations of layers two
suggest that the improvement is not as high as suggested by and three of our MNIST-MLP presented in Figure 7. Layer
Table 1 due to non-uniformly distributed matrix entries. The two of Figure 7 shows an improvement of 1.82 x, while the
first layer shows that our approach also decreases hardware implementations of layer three feature an improvement of
cost for matrices with one extreme dimension while the last 1.83 x. These findings show that our design not only provides
layer shows that, when slicing is done correctly, matrices with a reduction on the required number of LUTs for the imple-
one small dimension can also be implemented in efficient mentation of these CMMs and thus the entire net, but are
CMMs. more energy efficient than the standard implementation of the
For the implementation of the entire net, the weight matri- same. One key aspect of our future research is further analysis
ces are concatenated with adders for the bias as well as of these results in combination with ASIC implementation
the implementation of the non-linear activation function, comparisons of the respective designs.
in our case a tanh-function. With the chosen configuration To achieve a comparison to the execution of inference of
we achieve an overall improvement of 2.4 x which can be this net to the performance of a CPU we ran inference with
seen in the last row of Table 6. The numbers are the results 2000 classifications. Our test system is equipped with two
of the implementation of the entire net including the appli- AMD EPYC ROME 7352 CPUs featuring 48 cores clocked
cation of tanh activation functions in between the multiplica- at 3.2 GHz and a NVIDIA A100 (40GB) GPU. The measure-
tions with the corresponding weight matrices. With the final ment was set up by copying a random image from the MNIST
configuration and thus the hardware designs fixed we now dataset as an input to the corresponding memory, after that,
can compare power and timing results between the standard inference of the net described above was repeatedly executed
implementation, our CC-approach as well as a CPU and GPU and the execution times were measured. To achieve reliable
execution of the inference of the neural net. We found the results the process was repeated ten times and the measured
target frequency of the overall net to be at 372 MHz with times were averaged. Also the number of times any individual
39 pipeline stages, coinciding with the results presented in image was used as an input without any more storage accesses
Table 3. Vertical slicing of the weight matrices in the first was varied between 10 and 100000 times to guarantee a
and last layer leads to varying path lengths. There are paths saturated execution time per inference. In our measurement

VOLUME 11, 2023 3895


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

our CPU executed an average of 1554 inferences per sec- other computation paradigms, such as floating point arith-
ond while our GPU achieved 5688 inferences per second. metics, are also of importance. As is already alluded to, the
In comparison our hardware-design implemented on the variable accuracy of the decomposition by the CC-algorithm
ADM-PCIE-9H7 achieves a target frequency of 378 MHz can also be used to achieve the same accuracy that floating
while not saturating the available memory bandwidth of the point arithmetics provide. Future work will explore the hard-
HBM memory modules. A design with register only between ware cost of such a implementation and provide a comparison
the layers can still be run with 50 MHz while keeping the to existing floating point implementations, e.g. to dedicated
number of required registers to a minimum. Compared to DSPs as well as to LUT-based implementations.
the execution of the net on the mentioned GPU and CPU Another point of focus in our future research will be exper-
the CC-solution still provides a notable speedup of 32175 x imenting with varying entry-counts per row of a CC-matrix.
over the CPU and 8790 x over the GPU implementation. Note Instead of fixing the structure to only allow for two entries per
that this speedup is possible with an overall power budget of row, it is also possible to use more powers of two. With only
under 10 W while the NVIDIA A100 has a TDP of 300 W one entry there is no addition needed while four, eight or more
and the AMD EPYC ROME 7352 CPUs come with a TDP entries similar to the traditional approach require larger adder
of 155 W each. The power draw of these devices thus is not implementations. With a higher number of entries not only
comparable to the small power budget of our FPGA solution. the number of adders per CC-matrix vector product increases
With these findings we can infer that our approach to the but also the number of matrix factors required to approximate
execution of CMMs on FPGAs does not only outperform the original matrix decreases. This relation will be explored
standard implementations of this computational operation but further. Also different adder implementations like adder-trees
also provides an enormous improvement in performance over and linear adders can be compared in different aspects like
SoA solutions such as CPU and GPU implementations. hardware cost and critical path length.
Building upon all the mentioned future research, we will
VII. DISCUSSION AND CONCLUSION explore ways to implement our designs beyond FPGAs. For
In this paper, we presented a new method for lowering the implementations of DSP algorithms specialized accelerators
computational effort of CMMs, e.g., for ANN inference, already exist and our approach to CMM improves on them.
decomposing the constant (weight) matrices by slicing and In this aspect, we will explore ASIC implementations of
factorization. The resulting sub-matrices are sparse, with a our architecture. As already mentioned in the beginning, the
well-behaved structure and contain only numbers related to a general downside to ASICs when compared to FPGAs is the
power of two. Utilizing this a-priori knowledge, an efficient lack of reconfigurability. In these regards we will explore the
computer architecture is designed, which exploits the struc- performance of our design on CGRAs or even specialized
ture of the sub-matrices perfectly. Finally, hardware resources reconfigurable ASIC-like implementations where only the
can be decreased by a factor of 2 to 6. interconnections, i.e. the wiring that replaces the shifters in
While in this work, the main focus is set on MLP ANNs, the CMM units for CC-matrices, are reconfigured.
an increasing number of today’s applications use convolu-
tional neural nets (CNNs). We already found a method to REFERENCES
apply the linear computation coding procedure to this kind of
[1] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
ANNs. Investigations are ongoing. Future work here includes V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’
a modified decomposition algorithm as well as hardware in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
pp. 1–9.
architecture to support CNNs.
[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
Additionally, in this paper we focused on implementing an Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
architecture for CMMs that are equally or more accurate than Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
fixed-point implementations. Especially inference in ANNs [3] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-
time object detection with region proposal networks,’’ IEEE Trans. Pattern
do not always require this high level of accuracy. Thus, it is Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
possible to lower computational accuracy of certain appli- [4] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, ‘‘Beyond a Gaussian
cations while still achieving similar results. Hence another Denoiser: Residual learning of deep CNN for image denoising,’’ IEEE
future point of interest of ours is to explore the benefits of Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, Jul. 2017.
[5] Y. Bengio, A. Courville, and P. Vincent, ‘‘Representation learning:
tuning down computational accuracy and thereby improving A review and new perspectives,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
hardware efficiency even further. In this aspect we expect vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
our architecture to do well as there are two approaches we [6] A. Graves, A.-R. Mohamed, and G. Hinton, ‘‘Speech recognition with deep
can take here. First, it is possible to simply use less consecu- recurrent neural networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal
Process., May 2013, pp. 6645–6649.
tive matrix factors to approximate each matrix-slice. Second, [7] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu,
we can reduce the number of slices, which also reduces the ‘‘Convolutional neural networks for speech recognition,’’ IEEE/ACM
accuracy of computation for a fixed number of matrix factors. Trans. Audio, Speech Language Process., vol. 22, no. 10, pp. 1533–1545,
Oct. 2014.
Clearly, both approaches can be mixed in order to get a
[8] P. Bangalore and L. B. Tjernberg, ‘‘An artificial neural network approach
suitable trade-off between hardware-efficiency and accuracy for early fault detection of gearbox bearings,’’ IEEE Trans. Smart Grid,
of computation. In this direction of research comparisons to vol. 6, no. 2, pp. 980–987, Mar. 2015.

3896 VOLUME 11, 2023


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

[9] Y. Xu, Y. Sun, X. Liu, and Y. Zheng, ‘‘A digital-twin-assisted fault diagno- [30] A. Fan, P. Stock, B. Graham, E. Grave, R. Gribonval, H. Jegou, and
sis using deep transfer learning,’’ IEEE Access, vol. 7, pp. 19990–19999, A. Joulin, ‘‘Training with quantization noise for extreme model compres-
2019. sion,’’ 2020, arXiv:2004.07320, doi: 10.48550/ARXIV.2004.07320.
[10] O. Gustafsson, J. Coleman, A. Dempster, and M. Macleod, ‘‘Low- [31] G. B. Hacene, V. Gripon, M. Arzel, N. Farrugia, and Y. Bengio, ‘‘Quantized
complexity hybrid form fir filters using matrix multiple constant multipli- guided pruning for efficient hardware implementations of deep neural
cation,’’ in Conf. Rec. 38th Asilomar Conf. Signals, Syst. Comput., vol. 1, networks,’’ in Proc. 18th IEEE Int. New Circuits Syst. Conf. (NEWCAS),
2004, pp. 77–80. Jun. 2020, pp. 206–209.
[11] N. Boullis and A. Tisserand, ‘‘Some optimizations of hardware multi- [32] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and
plication by constant matrices,’’ IEEE Trans. Comput., vol. 54, no. 10, Y. Chen, ‘‘Cambricon-X: An accelerator for sparse neural networks,’’
pp. 1271–1282, Oct. 2005. in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),
[12] M. Kumm, M. Hardieck, and P. Zipf, ‘‘Optimization of constant matrix Oct. 2016, pp. 1–12.
multiplication with low power and high throughput,’’ IEEE Trans. Com- [33] T. Posewsky and D. Ziener, ‘‘A flexible FPGA-based inference architecture
put., vol. 66, no. 12, pp. 2072–2080, Dec. 2017. for pruned deep neural networks,’’ in Architecture of Computing Systems.
[13] L. Aksoy, P. Flores, and J. Monteiro, ‘‘A novel method for the approxi- Cham, Switzerland: Springer, 2018, pp. 311–323.
mation of multiplierless constant matrix vector multiplication,’’ EURASIP [34] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S. Williams,
J. Embedded Syst., vol. 2016, no. 1, pp. 1–11, Dec. 2016. P. Faraboschi, W.-M. W. Hwu, J. P. Strachan, K. Roy, and D. S. Milojicic,
‘‘PUMA: A programmable ultra-efficient memristor-based accelerator for
[14] M. Blott, T. B. Preußer, N. J. Fraser, G. Gambardella, K. O’brien,
machine learning inference,’’ in Proc. 24th Int. Conf. Architectural Support
Y. Umuroglu, M. Leeser, and K. Vissers, ‘‘FINN- R: An end-to-end deep-
Program. Lang. Operating Syst., New York, NY, USA, 2019, pp. 715–731,
learning framework for fast exploration of quantized neural networks,’’
doi: 10.1145/3297858.3304049.
ACM Trans. Reconfigurable Technol. Syst., vol. 11, no. 3, pp. 1–23,
[35] R. Mochida, K. Kouno, Y. Hayata, M. Nakayama, T. Ono, H. Suwa,
Dec. 2018, doi: 10.1145/3242897.
R. Yasuhara, K. Katayama, T. Mikawa, and Y. Gohou, ‘‘A 4M synapses
[15] R. R. Müller, B. Gade, and A. Bereyhi, ‘‘Linear computation coding,’’ in integrated analog ReRAM based 66.5 TOPS/W neural-network processor
Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Toronto, with cell current controlled writing and flexible network architecture,’’ in
ON, Canada, Jun. 2021, pp. 5065–5069. Proc. IEEE Symp. VLSI Technol., Jun. 2018, pp. 175–176.
[16] R. R. Müller, B. Gade, and A. Bereyhi, ‘‘Efficient matrix multiplication: [36] O. Krestinskaya and A. P. James, ‘‘Binary weighted memristive analog
The sparse power-of-2 factorization,’’ in Proc. Inf. Theory Appl. Workshop deep neural network for near-sensor edge processing,’’ in Proc. IEEE 18th
(ITA), San Diego, CA, USA, Feb. 2020, pp. 1–6. Int. Conf. Nanotechnol. (IEEE-NANO), Jul. 2018, pp. 1–4.
[17] C. Latotzke and T. Gemmeke, ‘‘Efficiency versus accuracy: A review of [37] Y. Li, S. Kim, X. Sun, P. Solomon, T. Gokmen, H. Tsai, S. Koswatta,
design techniques for DNN hardware accelerators,’’ IEEE Access, vol. 9, Z. Ren, R. Mo, C. C. Yeh, W. Haensch, and E. Leobandung, ‘‘Capacitor-
pp. 9785–9799, 2021. based cross-point array for analog neural network with record symmetry
[18] H. T. Kung and C. E. Leiserson, ‘‘Systolic arrays for (VLSI),’’ Dept. Com- and linearity,’’ in Proc. IEEE Symp. VLSI Technol., Jun. 2018, pp. 25–26.
put. Sci., Carnegie-Mellon Univ., Pittsburgh, PA, USA, Tech. Rep. CMU- [38] L. Fick, D. Blaauw, D. Sylvester, S. Skrzyniarz, M. Parikh, and D. Fick,
CS-79-103, 1978. ‘‘Analog in-memory subthreshold deep neural network accelerator,’’ in
[19] N. P. Jouppi et al., ‘‘In-datacenter performance analysis of a tensor pro- Proc. IEEE Custom Integr. Circuits Conf. (CICC), Apr. 2017, pp. 1–4.
cessing unit,’’ in Proc. 44th Annu. Int. Symp. Comput. Archit., New York, [39] E. Rosenthal, S. Greshnikov, D. Soudry, and S. Kvatinsky, ‘‘A fully analog
NY, USA, 2017, pp. 1–12, doi: 10.1145/3079856.3080246. memristor-based neural network with online gradient training,’’ in Proc.
[20] L. Jia, L. Lu, X. Wei, and Y. Liang, ‘‘Generating systolic array accelerators IEEE Int. Symp. Circuits Syst. (ISCAS), May 2016, pp. 1394–1397.
with reusable blocks,’’ IEEE Micro, vol. 40, no. 4, pp. 85–92, Jul. 2020. [40] (Jun. 2021). I. G. L.-I. für innovative Mikroelektronik. IHP Offers Access
[21] L. D. Medus, T. Iakymchuk, J. V. Frances-Villora, M. Bataller-Mompean, to Memristive Technology for Edge AI Computing or Hardware Artificial
and A. Rosado-Munoz, ‘‘A novel systolic parallel hardware architecture Neural Networks Applications. [Online]. Available: https://www.ihp-
for the FPGA acceleration of feedforward neural networks,’’ IEEE Access, microelectronics.com/de/news/news-detailansicht/ihp-off%ers-access-to-
vol. 7, pp. 76084–76103, 2019. memristive-technology-for-edge-ai-computing-or-hardware-artifici%al-
[22] S. Kala, B. R. Jose, J. Mathew, and S. Nalesh, ‘‘High-performance CNN neural-networks-applications
accelerator on FPGA using unified winograd-GEMM architecture,’’ IEEE [41] M. A. Nahmias, T. F. de Lima, A. N. Tait, H.-T. Peng, B. J. Shastri, and
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 12, pp. 2816–2828, P. R. Prucnal, ‘‘Photonic multiply-accumulate operations for neural net-
Dec. 2019. works,’’ IEEE J. Sel. Topics Quantum Electron., vol. 26, no. 1, pp. 1–18,
Jan. 2020.
[23] S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter,
[42] V. Bangari, B. A. Marquez, H. Miller, A. N. Tait, M. A. Nahmias,
‘‘NVIDIA tensor core programmability, performance & precision,’’ in
T. F. de Lima, H.-T. Peng, P. R. Prucnal, and B. J. Shastri, ‘‘Digital elec-
Proc. IEEE Int. Parallel Distrib. Process. Symp. Workshops (IPDPSW),
tronics and analog photonics for convolutional neural networks (DEAP-
May 2018, pp. 522–531.
CNNs),’’ IEEE J. Sel. Topics Quantum Electron., vol. 26, no. 1, pp. 1–13,
[24] K. Rocki, D. Van Essendelft, I. Sharapov, R. Schreiber, M. Morrison,
Jan. 2020.
V. Kibardin, A. Portnoy, J. F. Dietiker, M. Syamlal, and M. James, ‘‘Fast
[43] A. Rahim, T. Spuesens, R. Baets, and W. Bogaerts, ‘‘Open-access silicon
stencil-code computation on a wafer-scale processor,’’ in Proc. Int. Conf.
photonics: Current status and emerging initiatives,’’ Proc. IEEE, vol. 106,
High Perform. Comput., Netw., Storage Anal., Nov. 2020, pp. 1–14.
no. 12, pp. 2313–2330, Dec. 2018.
[25] I. Bae, B. Harris, H. Min, and B. Egger, ‘‘Auto-tuning CNNs for coarse- [44] V. Strassen, ‘‘Gaussian elimination is not optimal,’’ Numer. Math., vol. 13,
grained reconfigurable array-based accelerators,’’ IEEE Trans. Comput.- no. 4, pp. 354–356, 1969.
Aided Design Integr. Circuits Syst., vol. 37, no. 11, pp. 2301–2310, [45] A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes,
Nov. 2018. M. Barekatain, A. Novikov, F. J. R. Ruiz, J. Schrittwieser, G. Swirszcz,
[26] E. Wang, J. J. Davis, P. Y. K. Cheung, and G. A. Constantinides, ‘‘LUTNet: D. Silver, D. Hassabis, and P. Kohli, ‘‘Discovering faster matrix multipli-
Learning FPGA configurations for highly efficient neural network infer- cation algorithms with reinforcement learning,’’ Nature, vol. 610, no. 7930,
ence,’’ IEEE Trans. Comput., vol. 69, no. 12, pp. 1795–1808, Dec. 2020. pp. 47–53, Oct. 2022.
[27] H. Ye, X. Zhang, Z. Huang, G. Chen, and D. Chen, ‘‘HybridDNN: [46] A. D. Booth, ‘‘A signed binary multiplication technique,’’ Quart. J. Mech.
A framework for high-performance hybrid DNN accelerator design and Appl. Math., vol. 4, pp. 236–240, Jan. 1951.
implementation,’’ in Proc. 57th ACM/IEEE Design Autom. Conf. (DAC), [47] J. E. Volder, ‘‘The CORDIC trigonometric computing technique,’’ IRE
Jul. 2020, pp. 1–6. Trans. Electron. Comput., vol. EC-8, no. 3, pp. 330–334, Sep. 1959.
[28] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-M. Hwu, and [48] E. Liberty and S. W. Zucker, ‘‘The mailman algorithm: A note on matrix-
D. Chen, ‘‘DNNBuilder: An automated tool for building high-performance vector multiplication,’’ Inf. Process. Lett., vol. 109, pp. 179–182, Jan. 2009.
DNN hardware accelerators for FPGAs,’’ in Proc. IEEE/ACM Int. Conf. [49] N. Maheshwari and S. S. Sapatnekar, Clock Skew Optimization. Boston,
Comput.-Aided Design (ICCAD), Nov. 2018, pp. 1–8. MA, USA: Springer, 1999, pp. 33–64, doi: 10.1007/978-1-4615-5637-4_3.
[29] A. Demidovskij and E. Smirnov, ‘‘Effective post-training quantization of [50] S. G. Mallat and Z. Zhang, ‘‘Matching pursuit with time-frequency dic-
neural networks for inference on low power neural accelerator,’’ in Proc. tionaries,’’ IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3397–3415,
Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2020, pp. 1–7. Dec. 1993.

VOLUME 11, 2023 3897


A. Lehnert et al.: Most Resource Efficient Matrix Vector Multiplication on FPGAs

[51] R. Müller, ‘‘Linear computation coding inspired by the Lempel-Ziv RALF MÜLLER (Fellow, IEEE) received the
algorithm,’’ in Proc. IEEE Inf. Theory Workshop (ITW), Nov. 2022, Dipl.-Ing. and Dr.-Ing. (Hons.) degrees from
pp. 606–611. Friedrich- Alexander-Universität (FAU) Erlangen-
[52] R. M. Gray and D. L. Neuhoff, ‘‘Quantization,’’ IEEE Trans. Inf. Theory, Nürnberg, in 1996 and 1999, respectively.
vol. 44, no. 6, pp. 2325–2383, Oct. 1998. From 2000 to 2004, he has directed a Research
[53] (1993). I. S. C/DA. IEEE 1076–1993. Accessed: Oct. 17, 2022. [Online]. Group at the Telecommunications Research Cen-
Available: https://standards.ieee.org/ieee/1076/1611/ ter, Vienna, Austria, and taught as an Adjunct Pro-
[54] M. Naumov et al., ‘‘Deep learning recommendation model for person-
fessor at TU Wien. In 2005, he was appointment as
alization and recommendation systems,’’ 2019, arxiv:1906.00091, doi:
a Full Professor at the Department of Electronics
10.48550/ARXIV.1906.00091.
[55] K. Khalil, A. Kumar, and M. Bayoumi, ‘‘Reconfigurable hardware design and Telecommunications, Norwegian University
approach for economic neural network,’’ IEEE Trans. Circuits Syst. II, Exp. of Science and Technology, Trondheim, Norway. In 2013, he joined the
Briefs, vol. 69, no. 12, pp. 5094–5098, Dec. 2022. Institute for Digital Communications at FAU in Erlangen, Germany. He was
a co-recipient of the Leonard G. Abraham Prize from the IEEE Communica-
tions Society. He was presented awards for his dissertation by the Vodafone
ALEXANDER LEHNERT received the mas- Foundation for Mobile Communications and the German Information Tech-
ter’s degree in computer science from the nology Society (ITG). He received the ITG Award for the paper ‘‘A Random
Friedrich-Alexander University Erlangen- Matrix Model for Communication via Antenna Arrays.’’ He was also a
Nürnberg (FAU), Germany, in 2022. He is co-recipient of the Philipp-Reis Award. He served as an Associate Editor
currently a Researcher at the Brandenburg Univer- for the IEEE TRANSACTIONS ON INFORMATION THEORY, from 2003 to 2006, and
sity of Technology Cottbus-Senftenberg (BTU), an Executive Editor for the IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS,
Germany. His main research interest includes from 2014 to 2016.
development and optimization of implementations
of machine learning algorithms, with a focus on
reconfigurable hardware.

PHILIPP HOLZINGER received the master’s


degree in computer science from Friedrich-
Alexander University Erlangen-Nürnberg (FAU),
Germany, in 2017. He is currently a Researcher
at the Chair of computer architecture, FAU. His
research interest includes the design of heteroge-
neous system architectures, with a focus on recon-
MARC REICHENBACH (Member, IEEE) received
figurable and near-memory computing.
the Diploma degree in computer science from
Friedrich-Schiller University Jena, Germany,
in 2010, and the Ph.D. degree from Friedrich-
SIMON PFENNING received the master’s degree Alexander University Erlangen-Nürnberg (FAU),
in information and communication technology Germany, in 2017. From 2017 to 2021, he worked
from Friedrich-Alexander University Erlangen- as a Postdoctoral Researcher at the Chair of com-
Nürnberg (FAU), Germany, in 2019. He currently puter architecture, FAU. Since 2021, he has been
works as a Researcher at the Chair of computer heading the Chair of computer engineering at the
architecture, FAU. His research interest includes Brandenburg University of Technology Cottbus-
the development and optimization of hardware Senftenberg (BTU), Germany, as a Substitute Professor. His research inter-
platforms for machine learning. ests include novel computer architectures, memristive computing, and smart
sensor architectures for varying application fields.

3898 VOLUME 11, 2023

You might also like