0% found this document useful (0 votes)

144 views15 pages

Karatsuba Matrix Multiplication and Its Efficient Custom Hardware Implementations

The document discusses the extension of the Karatsuba algorithm to matrix multiplication, proposing new hardware architectures that efficiently leverage this method for improved performance in integer matrix multiplication. It highlights the algorithm's ability to reduce multiplication complexity while addressing the overhead of additional operations, and presents a complexity analysis alongside evaluations of the proposed designs in comparison to conventional methods. The findings suggest significant area and execution time benefits for custom hardware implementations in deep learning and general matrix multiplication applications.

Uploaded by

Shafi Mn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

144 views15 pages

Karatsuba Matrix Multiplication and Its Efficient Custom Hardware Implementations

Uploaded by

Shafi Mn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/387775422

Karatsuba Matrix Multiplication and its Efﬁcient Custom Hardware

Implementations

Article in IEEE Transactions on Computers · January 2025

DOI: 10.1109/TC.2025.3525606

CITATIONS READS
0 28

2 authors, including:

Trevor Pogue
McMaster University
4 PUBLICATIONS 12 CITATIONS

SEE PROFILE

All content following this page was uploaded by Trevor Pogue on 09 January 2025.

The user has requested enhancement of the downloaded file.

Karatsuba Matrix Multiplication and its Efficient

Custom Hardware Implementations
Trevor E. Pogue and Nicola Nicolici , Senior Member, IEEE

Abstract—While the Karatsuba algorithm reduces the com- direction, the Karatsuba algorithm [7] can also theoretically be
plexity of large integer multiplication, the extra additions used to reduce the complexity of integer multiplication. How-
required minimize its benefits for smaller integers of more ever, the extra addition operations it introduces can increase its
commonly-used bitwidths. In this work, we propose the extension
of the scalar Karatsuba multiplication algorithm to matrix execution speed in general-purpose computers or limit its area
multiplication, showing how this maintains the reduction in mul- reduction in custom multiplier circuits for smaller integers of
tiplication complexity of the original Karatsuba algorithm while more commonly-used bitwidths [8], [9].
reducing the complexity of the extra additions. Furthermore, we In this work, we show how the scalar Karatsuba multi-
propose new matrix multiplication hardware architectures for
efficiently exploiting this extension of the Karatsuba algorithm plication algorithm can be extended to integer matrix mul-
in custom hardware. We show that the proposed algorithm tiplication, after which the impact and complexity of the
and hardware architectures can provide real area or execution extra additions is reduced. Furthermore, we investigate and
time improvements for integer matrix multiplication compared present new fixed-precision and precision-scalable hardware
to scalar Karatsuba or conventional matrix multiplication al- architectures for efficiently exploiting the Karatsuba algorithm
gorithms, while also supporting implementation through proven
systolic array and conventional multiplier architectures at the extended to matrix multiplication (referred to as Karatsuba
core. We provide a complexity analysis of the algorithm and matrix multiplication or KMM), showing how the proposed
architectures and evaluate the proposed designs both in isolation algorithm and hardware architectures can provide real area
and in an end-to-end deep learning accelerator system compared or execution time reductions for integer matrix multiplication
to baseline designs and prior state-of-the-art works implemented compared to scalar Karatsuba or conventional matrix multipli-
on the same type of compute platform, demonstrating their ability
to increase the performance-per-area of matrix multiplication cation.
hardware. The proposed architectures can also be implemented using
Index Terms—Hardware, Computer architecture, Systolic ar- proven systolic array and conventional multiplier architectures
rays, Performance, Throughput, Computational modeling at their core, maintaining all the implementation benefits of
these architectures. Systolic arrays, which we will also refer
I. I NTRODUCTION to as matrix multiplication units (MXU)s for convenience,
are an effective choice for use in GEMM accelerators as
HE demand for optimized hardware acceleration of gen-
T eral matrix multiplication (GEMM) continues to drive
innovation in the field of hardware design for exploiting the
they significantly reduce the required memory traffic and can
reach high clock frequencies due to their short and regular
interconnects. Systolic-array architectures have been used in
inherent parallelism to speed up computation. At a certain state-of-the-art GEMM and deep learning accelerators such as
point, however, after the known parallelism and system-level the Tensor Processing Unit (TPU) [3], [4], [10], among others
optimizations are exhausted and technology scaling slows to a [6], [11].
halt, there is an accelerator wall which limits further progress
In summary, our key contributions are the following:
on the implementation side [1]. A less-explored direction for
continuing advancement beyond this wall is through reducing • We propose the Karatsuba matrix multiplication (KMM)
the workload at the algebraic level, by computing the same algorithm and carry out a complexity analysis of the
result from a re-arranged compute pattern requiring fewer or algorithm compared to conventional scalar Karatsuba
cheaper operations to be performed in hardware. and matrix multiplication algorithms to facilitate fur-
Multiply-accumulate (MAC) units are commonly the area- ther future investigations of potential applications and
dominant computational resource in GEMM and deep learning hardware implementations of KMM. We also identify
accelerators [2], [3], [4], and due to this, an accelerator’s complexity shortcomings of KMM that restrict its benefits
throughput can be directly limited by how many multipliers in hardware and show how this is mitigated when KMM
its hardware budget can afford. As a result, surpassing this is combined with an alternative accumulation algorithm.
performance per multiplier limit has been focused on recently • We present a new family of hardware architectures for
with minimal filtering algorithms applied to convolutional neu- efficiently exploiting KMM in custom hardware. We then
ral networks [2], [5], as well fast inner-product algorithms for model the area or execution time benefits of the KMM
GEMM and machine learning workloads [6]. Along this same architectures and evaluate the proposed architectures both
in isolation and in an end-to-end accelerator system com-
T. E. Pogue and N. Nicolici are with the Department of Electrical and Com-
puter Engineering, McMaster University, Hamilton, ON, L8S 4L8, Canada pared to baseline designs and prior state-of-the-art works
Email: [email protected]; [email protected] implemented on the same type of compute platform.

©2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing
this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this
work in other works. Accepted for publication in IEEE Transactions on Computers. DOI: 10.1109/TC.2025.3525606
2

Conventional 2-Digit Scalar Multiplication (SM2 ) Algorithm 1 Conventional n-Digit Scalar Multiplication.
[w]
a × b 1: function SMn (a, b)
= (a1 ≪w/2 + a0) × (b1 ≪w/2 + b0 ) 2: if (n > 1) then
3: a1 = a[w−1:⌈w/2⌉]
4: a0 = a[⌈w/2⌉−1:0]
a1 a0 a1 a0 5: b1 = b[w−1:⌈w/2⌉]
6: b0 = b[⌈w/2⌉−1:0]
[⌊w/2⌋]
b1 b1 b0 b0 7: c1 = SMn/2 (a1 , b1 )
[⌈w/2⌉]
8: c10 = SMn/2 (a1 , b0 )
[⌈w/2⌉]
+ a1 b0 ≪w/2 9: c01 = SMn/2 (a0 , b1 )
[⌈w/2⌉]
= a1 b1 ≪w + a0 b1 ≪w/2 + a0 b0 10: c0 = SMn/2 (a0 , b0 )
11: c = c1 ≪ w
Fig. 1. SM2 algorithm illustration. 12: c += (c01 + c10 ) ≪ ⌈w/2⌉
13: c += c0
14: else
15: c=a×b
II. BACKGROUND AND R ELATED W ORK 16: end if
A. Notation 17: return c
18: end function
We use the following notation throughput this article:
[w] 2-Digit Karatsuba Scalar Multiplication (KSM2 )
• ALGn : An algorithm that operates on w-bit scalars
or matrices with w-bit elements, where each scalar or a × b
matrix element is divided into n digits. For example, = (a1 ≪w/2 + a0) × (b1 ≪w/2 + b0 )
[8]
SM2 represents a scalar multiplication (SM) algorithm
for operating on 8-bit 2-digit numbers where each digit a1 (a1+ a0) a0
is 4 bits wide, such as the multiplication between the
hexadecimal values 0x12 × 0x10 = 0x120. b1 ( b1 + b0 ) b0
– ALGn or ALG: The algorithm acronym may also be
specified without the subscript n and/or superscript [w] a 1 b0
when the number of digits and/or input bitwidths are = a1 b1 ≪w + + a0 b1 ≪w/2 + a0 b0
not directly relevant for the current context, and it may + a 0 b0
refer to the use of the algorithm for any value of n or + a 1 b1
w for each missing subscript and/or superscript. − a 0 b0
• OPERATION[w] : An arithmetic operation that works − a 1 b1
with w-bit values. For example, MULT[w] , ADD[w] ,
ACCUM[w] represent a multiplication, addition, and ac- Fig. 2. KSM2 algorithm illustration. Compared to SM2 , KSM2 requires only
3 single-digit multiplications, however, it requires 3 more additions, increasing
cumulation of w-bit values, respectively, and SHIFT[w] the overall operation count.
represents a left or right shift by w bits.
• x[a:b] : The value contained in bits a down to b of a scalar
x. For example, the value of bits 7 down to 4 in the C. Karatsuba Scalar Multiplication (KSM)
hexadecimal number 0xAE is equal to 0xA and is written Fig. 2 shows the Karatsuba algorithm [7] for 2-digit scalar
as 0xAE [7:4] = 0xA. Similarly, 0xAE [3:0] = 0xE. multiplication where a w-bit multiplication is split this time
• C(ALG[w] n ): The complexity of algorithm ALG in num- into three smaller-bit multiplications before being summed to
ber of w-bit multiplications, additions, accumulations, form the final product. Algorithm 2 shows the generalization of
and shift operations. this, where n-digit multiplication is performed by carrying out
• C(ALGn ): The complexity of algorithm ALG in number the same steps recursively for each smaller-bit multiplication.
of arithmetic operations. KSM-based low-bitwidth accurate integer multiplier circuits
• r: The number of recursion levels implemented in KSM in prior works have shown some area benefits for input
or KMM, equal to ⌈log2 n⌉. bitwidths in the range of 64 bits or less, with minimal area
• d: The height and width of two matrices being multiplied. improvements in the smallest ranges of 16 bits [8], [9].

B. Conventional n-Digit Scalar Multiplication (SM) D. Conventional n-Digit Matrix Multiplication (MM)

Fig. 1 shows the conventional method for performing 2-digit A conventional matrix multiplication algorithm computes
scalar multiplication where a w-bit multiplication is split into C = AB for A of size M × K and B of size K × N , where
four smaller-bit scalar multiplications before being summed to each element ci,j of C is calculated as follows:
form the final product. Algorithm 1 shows the generalization of K
X
this, where n-digit multiplication is performed by carrying out ci,j = ai,k bk,j . (1)
the same steps recursively for each smaller-bit multiplication. k=1
3

Algorithm 2 n-Digit Karatsuba Scalar Multiplication. Algorithm 3 Conventional n-Digit Matrix Multiplication.
[w] [w]
1: function KSMn (a, b) 1: function MMn (A, B)
2: if (n > 1) then 2: if (n > 1) then
3: a1 = a[w−1:⌈w/2⌉]  [w−1:⌈w/2⌉]
a1,1 , ...
[w−1:⌈w/2⌉] 
a1,K
4: a0 = a[⌈w/2⌉−1:0] 3: A1 =  ... ... ...
b1 = b[w−1:⌈w/2⌉]

5: [w−1:⌈w/2⌉] [w−1:⌈w/2⌉]
6: b0 = b[⌈w/2⌉−1:0] aM,1 , ... aM,K
 [⌈w/2⌉−1:0] [⌈w/2⌉−1:0] 
7: as = a1 + a0 a1,1 , ... a1,K
8: bs = b1 + b0 4: A0 =  ... ... ... 
[⌊w/2⌋] [⌈w/2⌉−1:0] [⌈w/2⌉−1:0]
9: c1 = KSMn/2 (a1 , b1 ) aM,1 , ... aM,K
[⌈w/2⌉+1]  [w−1:⌈w/2⌉] [w−1:⌈w/2⌉] 
10: cs = KSMn/2 (as , bs ) b1,1 , ... b1,N
[⌈w/2⌉] ... ... ...
11: c0 = KSMn/2 (a0 , b0 ) 5: B1 =  
[w−1:⌈w/2⌉] [w−1:⌈w/2⌉]
12: c = c1 ≪ w b , ... bK,N
13: c += (cs − c1 − c0 ) ≪ ⌈w/2⌉  K,1
[⌈w/2⌉−1:0] [⌈w/2⌉−1:0] 
b1,1 , ... b1,N
14: c += c0 6: B0 =  ... ... ... 
15: else [⌈w/2⌉−1:0] [⌈w/2⌉−1:0]
16: c=a×b bK,1 , ... bK,N
17: end if 7: C1 = MMn/2
[⌊w/2⌋]
(A1 , B1 )
18: return c [⌈w/2⌉]
19: end function 8: C10 = MMn/2 (A1 , B0 )
[⌈w/2⌉]
9: C01 = MMn/2 (A0 , B1 )
[⌈w/2⌉]
Conventional 2-Digit Matrix Multiplication (MM2 ) 10: C0 = MMn/2 (A0 , B0 )
11: C = C1 ≪ w
[A] × [B] 12: C += (C10 + C01 ) ≪ ⌈w/2⌉
= ([A1 ]≪w/2 + [A0 ]) × ([B1 ] ≪w/2 + [B0 ] ) 13: C += C0
14: else
[w]
15: C = MM1 (A, B)
[A1 ] [A0 ] [A1 ] [A0 ] 16: end if
17: return C
[B1 ] [B1 ] [B0 ] [B0 ] 18: end function
O(d3 ) O(d3 )
O(d2 )
[A1 B0 ]≪w/2
= [A1 B1 ]≪w + [A0 B1 ]≪w/2 + [A0 B0 ] reduced-bitwidth operations with little to no loss in accuracy
but the bitwidths required to provide sufficient accuracy vary
Fig. 3. MM2 algorithm illustration. The 4 single-digit matrix multiplications across different deep neural network models, applications, and
of complexity O(d3 ) dominate the O(d2 ) complexity of the matrix additions.
between individual layers within the same neural network
model [12]. For example, some neural network models can
The method in Fig. 1 can also be extended to matrix be executed with high accuracy even when performing the
multiplication as illustrated in Fig. 3, where four separate majority of the operations on small bitwidths, however, a
partial-product matrix multiplications are performed between smaller portion of the layers still need to be computed on
matrices each containing bit slices of every element, and they larger bitwidths to preserve accuracy [12]. Therefore, a fixed-
are later summed together to form the final matrix product. bitwidth accelerator must make a trade-off between either
Algorithm 3 shows the generalization of this, where n-digit supporting only lower bitwidths while reducing the model’s
matrix multiplication is performed by carrying out the same accuracy, or supporting larger bitwidths for higher accuracy
steps recursively for each smaller-bit matrix multiplication. but under-utilizing the MAC units during majority of compu-
The elements in matrices A0 and B0 contain the lower bits tation as most stages require only lower-bit inputs.
(bits ⌈w/2⌉ − 1 down to 0) of every element in the A and Precision-scalable architectures address this trade-off by
B matrices, while A1 and B1 contain the upper bits (bits providing architectures which can more efficiently support
w − 1 down to ⌈w/2⌉) of every element in matrices A and B. execution of varying input bitwidths [12], [13], [14]. One
This allows for w-bit matrix multiplication using smaller m- approach is to use MAC units consisting of multiple smaller-
bit multipliers. The MM1 algorithm on line 15 of Algorithm 3 bitwidth multipliers [13] which can either be individually used
is a conventional matrix multiplication algorithm such as (1). to multiply/accumulate multiple smaller-bitwidth products, or
they can be reconfigured to collectively multiply/accumulate
fewer larger-bitwidth products per clock cycle. Another type
E. Precision-Scalable Architectures of approach referred to as bit-serial architectures [14], is to
Precision-scalable architectures allow a way to efficiently have MAC arrays which repeatedly perform fixed-size smaller-
execute workloads across multiple input precisions for ap- bitwidth vector operations on different bit slices of the vectors,
plications where the input bitwidths are expected to vary. summing up the separate vector products to get the final full-
Machine learning (ML) acceleration is one example of a bitwidth vector result.
use-case for precision-scalable hardware architectures, where The hardware algorithms used in prior works on precision-
neural networks can perform the majority of the inference on scalable architectures [12], [13], [14] use variations of the
4

2-Digit Karatsuba Matrix Multiplication (KMM2 ) Algorithm 4 n-Digit Karatsuba Matrix Multiplication.
[w]
[A] × [B] 1: function KMMn (A, B)
2: if (n > 1) then
= ([A1 ]≪w/2 + [A0 ]) × ([B1 ] ≪w/2 + [B0 ] )  [w−1:⌈w/2⌉] [w−1:⌈w/2⌉] 
a1,1 , ... a1,K
[A1 ] O(d2 ) ([A1 ]+ [A0 ]) [A0 ] 3: A1 =  ... ... ... 
[w−1:⌈w/2⌉] [w−1:⌈w/2⌉]
O(d3 ) O(d3 )
aM,1 , ... aM,K
 [⌈w/2⌉−1:0] [⌈w/2⌉−1:0] 
[B1 ] O(d2 ) ([B1 ] + [B0 ] ) [B0 ] a1,1 , ... a1,K
4: A0 =  ... ... ... 
[⌈w/2⌉−1:0] [⌈w/2⌉−1:0]
O(d2 ) aM,1 , ... aM,K
[A1 B0 ]
 [w−1:⌈w/2⌉] [w−1:⌈w/2⌉] 
b1,1 , ... b1,N
= [A1 B1 ]≪w + + [A B ] ≪w/2 + [A0 B0 ] 5: B1 =  ... ... ... 
0 1 [w−1:⌈w/2⌉] [w−1:⌈w/2⌉]
b , ... bK,N
+ [A0 B0 ]  K,1
[⌈w/2⌉−1:0] [⌈w/2⌉−1:0] 
b1,1 , ... b1,N
+ [A1 B1 ] 6: B0 =  ... ... ... 
[⌈w/2⌉−1:0] [⌈w/2⌉−1:0]
− [A0 B0 ] bK,1 , ... bK,N
− [A1 B1 ] 7: As = A1 + A0
8: Bs = B1 + B0
[⌊w/2⌋]
Fig. 4. KMM2 algorithm illustration. Compared to the scalar algorithms
9: C1 = KMMn/2 (A1 , B1 )
[⌈w/2⌉+1]
KSM2 versus SM2 , the increase in number of additions with complexity 10: Cs = KMMn/2 (As , Bs )
O(d2 ) in KMM2 versus MM2 is now insignificant relative to the reduction [⌈w/2⌉]
of 3 instead of 4 single-digit matrix multiplications of complexity O(d3 ),
11: C0 = KMMn/2 (A0 , B0 )
allowing the overall #operations in KMM2 to be less than conventional MM2 . 12: C = C1 ≪ w
13: C += (Cs − C1 − C0 ) ≪ ⌈w/2⌉
14: C += C0
15: else
SM and MM algorithms shown in Algorithms 1 and 3 to 16:
[w]
C = MM1 (A, B)
combine partial products and compute variable-bitwidth w- 17: end if
bit matrix products using smaller m-bit multipliers, where 18: return C
the number of m-bit multiplications and minimum possible 19: end function
execution time if fully utilizing the m-bit multipliers scales
quadratically with the input bitwidths w. However, as shown
later, the minimum possible execution time of a precision- therefore their elements have a bitwidth of ⌈w/2⌉ + 1. The
scalable KMM architecture scales less than quadratically with partial-product matrices are then summed analogously to how
the input bitwidths w. the partial scalar products are summed after multiplication in
KSM from Algorithm 2.
III. K ARATSUBA M ATRIX M ULTIPLICATION (KMM )
B. KMM Complexity Analysis
In this section, we formally define KMM, analyze its com-
plexity compared to conventional scalar Karatsuba and matrix In this subsection, we derive the complexity of KMM and
multiplication algorithms, identify complexity shortcomings of compare it to the complexity of the conventional MM, and
the KMM algorithm that restrict its benefits in hardware, and KSM algorithms. To do this, we decompose each algorithms’
show how this is mitigated when combining KMM with an complexity to number of w-bit multiplications, additions, and
alternative accumulation algorithm. shift operations. This provides a general technology-agnostic
foundation for evaluating different possible KMM hardware
implementations and modelling the costs and benefits of im-
A. KMM Definition plementing the algorithm in hardware across different possible
Fig. 4 shows the 2-digit Karatsuba scalar multiplication implementation technologies where the cost of each type of
algorithm [7] from Fig. 2 extended to matrix multiplication operation may vary depending on the implementation platform
analogously to how Fig. 3 extends conventional 2-digit scalar used. For example, implementations on FPGA may result in
multiplication in Fig. 1 to matrix multiplication. Algorithm multipliers mapping to DSP units, additions and accumulations
4 shows the generalization of this, where n-digit Karatsuba mapping to soft look-up-table (LUT) and register resources,
matrix multiplication is performed by carrying out the same whereas ASIC implementations will result in different costs
steps recursively for each smaller-bit matrix multiplication. In and trade-offs than this for each type of operation.
Algorithm 4, the full matrix product is split into three separate Additionally, while the main focus of this work is on lever-
partial-product matrix multiplications between matrices each aging KMM in custom hardware designs, we also compare
containing bit slices of every element. The elements in matri- KMM’s complexity more simply in number of arithmetic
ces A0 and B0 contain the lower bits (bits ⌈w/2⌉−1 down operations to allow modelling the time complexity of KMM
to 0) of every element in the A and B matrices, while A1 execution on general-purpose hardware containing fixed oper-
and B1 contain the upper bits (bits w−1 down to ⌈w/2⌉) ator word sizes. This analysis (plotted in Fig. 5) indicates that
of every element in matrices A and B. The As and Bs KMM requires significantly fewer operations to execute large-
matrices are formed by summing A1 + A0 and B1 + B0 , and integer matrix multiplication on general-purpose hardware
5

than conventional KSM or MM algorithms. This is relevant KSMM then has the following complexity:
when the matrix element bitwidths are larger than the word
size of the general-purpose hardware operators, for example, C(KSMM[w]
n )=d
3
C(KSM[w]
n ) + ACCUM
[2w]
. (4)
inputs larger than 32 bits when executing on a CPU containing 4) KMM Complexity: The complexity of KMM is derived
arithmetic logic units (ALU)s that support 32-bit inputs. by counting the operations performed in Algorithm 4:
1) MM Complexity: The complexity of conventional n-digit
MM between two matrices of size d×d is derived by counting C(KMM[w]n ) =2d
2
ADD[2⌈w/2⌉+4+wa ] + ADD[2w+wa ]
the number of operations that are performed in Algorithm 3:
+ d2 2 ADD[⌈w/2⌉] + SHIFT[w] + SHIFT[⌈w/2⌉]
[⌊w/2⌋] [⌈w/2⌉]
C(MM[w]
n ) = C(MMn/2 ) + 3 C(MMn/2 ) [⌊w/2⌋] [⌈w/2⌉+1]
+ C(KMMn/2 ) + C(KMMn/2 )
+ d2 ADD[w+wa ] + 2 ADD[2w+wa ] [⌈w/2⌉]
+ C(KMMn/2 ) (5a)
+ d2 SHIFT[w] + SHIFT[⌈w/2⌉] (2a) [w] [w]
C(KMM1 ) = C(MM1 ) . (5b)

[w]
C(MM1 ) = d3 MULT[w] + ACCUM[2w] . (2b) The two ADD[⌈w/2⌉] terms in (5a) come from the ⌈w/2⌉-
bit additions forming the As and Bs terms on lines 7 and
Typically, ACCUM[2w] = ADD[2w+wa ] , where wa is an addi- 8 of Algorithm 4. The two ADD[2⌈w/2⌉+4+wa ] terms in (5a)
tional bitwidth added to account for accumulation. However, in come from forming the (Cs − C1 − C0 ) term on line 13 of
Section III-C, we discuss a method for reducing the complexity Algorithm 4, where these terms can be first summed together
of the accumulations to be less than this. on 2⌈w/2⌉ + 4 + wa bits before being shifted and added to
The ADD[w+wa ] terms in (2a) come from the additions the other product terms. The bitwidth 2⌈w/2⌉ + 4 + wa is
forming the (C10 + C01 ) term on line 12 of Algorithm 3. required because the bitwidth of Cs is 2⌈w/2⌉+2+wa since it
Here, the bitwidth of the C10 and C01 elements is w + wa is accumulations of (2⌈w/2⌉+2)-bit products of (⌈w/2⌉+1)-
because they are accumulations of w-bit products of ⌊w/2⌋ bit values, and the additional two bits are to account for sign
and ⌈w/2⌉-bit values. The two ADD[2w+wa ] terms in (2a) extension and subtraction of the C1 and C0 terms. The two
come from the additions to C on lines 12 and 13 of Algorithm ADD[2w+wa ] terms in (5a) come from the additions to C on
3. The bitwidth of these additions is kept on 2w+wa bits since lines 13 and 14 of Algorithm 4. The bitwidth of these additions
C results in accumulations of 2w-bit products of w-bit values. is kept on 2w + wa bits since C results in accumulations of
2) KSM Complexity: The complexity of KSM is derived 2w-bit products of w-bit values.
by counting the operations performed in Algorithm 2: (5a) shows that KMM significantly reduces the complexity
of the 8 addition and shift operations in (3a) that are performed
C(KSMn[w] ) = 2 ADD[2w] + ADD[⌈w/2⌉] + ADD[2⌈w/2⌉+4] (n/2)log2 3 d3 times in KSMM by reducing their occurrence
+ SHIFT[w] + SHIFT[⌈w/2⌉] by a factor of d. On the other hand, KMM trades d3 accumu-
[⌊w/2⌋] [⌈w/2⌉+1] lations of 2w-bit values in (2b) or (4) for nlog2 3 d3 smaller-
+ C(KSMn/2 ) + C(KSMn/2 ) width accumulations in (5b). However, in Section III-C we
[⌈w/2⌉] show how the penalty of this in hardware is mitigated when
+ C(KSMn/2 ) (3a)
[w] [w]
combining KMM with an alternative accumulation algorithm.
C(KSM1 ) = MULT . (3b) 5) Arithmetic complexity: If only counting the number of
operations without considering operation bitwidths or type, we
The two ADD[⌈w/2⌉] terms in (3a) come from the ⌈w/2⌉-
can simplify (2) to:
bit additions forming the as and bs terms on lines 7 and
8 of Algorithm 2. The two ADD[2⌈w/2⌉+4] terms in (3a) C(MMn ) = 2 n2 d3 + 5 (n/2)2 d2 , (6)
come from forming the (cs − c1 − c0 ) term on line 13 of
(4) can be simplified to:
Algorithm 2, where these terms can be first summed together
on 2⌈w/2⌉+4 bits before being shifted and added to the other C(KSMMn ) = (1 + 11 (n/2)log2 3 ) d3 , (7)
product terms. The bitwidth 2⌈w/2⌉ + 4 is required because
cs is a (2⌈w/2⌉+2)-bit product of (⌈w/2⌉+1)-bit values, and and (5) can be simplified to:
the additional two bits are to account for sign extension and
C(KMMn ) = (n/2)log2 3 (6 d3 + 8 d2 ) . (8)
subtraction of the c1 and c0 terms. The two ADD[2w] terms
in (3a) come from the additions to c on lines 13 and 14 of
Algorithm 2. These additions are on 2w-bit values since c will C. Mitigating the Accumulator Complexity Increase in KMM
ultimately result in the 2w-bit product of two w-bit values. As found in Section III-B, KMM has one penalty of trading
3) KSMM Complexity: To compare KSM to KMM and d3 accumulations of 2w-bit values in (2b) or (4) for nlog2 3 d3
the other matrix multiplication algorithms, we analyze the smaller-width accumulations in (5b). In this subsection, we
complexity of an algorithm we refer to as KSMM. KSMM is show how this downside is mitigated when using Algorithm
defined as a conventional matrix multiplication algorithm as 5 as the MM1 algorithm in KMM on line 16 of Algorithm 4.
in (1), but where KSM is used for the multiplications between Algorithm 5 performs MM1 using an alternative accumulation
all elements rather than conventional scalar multiplication. structure that reduces the accumulation hardware complexity.
6

ai+1,k bk,j−1 ai+1,k+1 bk+1,j−1 ai+1,k+p−1 bk+p−1,j−1

w w load w w load w w load

ai,k bk,j ai,k+1 bk+1,j ai,k+p−1 bk+p−1,j

PEk PEk+1 PEk+p−1

× × ×
2w 2w 2w
+ +
2w + wp

ci,j + ci−1,j
(partial sum) 2w+wa

Fig. 5. Plotting (6) and (7) relative to (8) for different n with d = 64. Fig. 6. Showing the internal PE structure of the MM1 MXUs shown in Fig. 7
As can be seen, KSMMn requires over 75% more operations than KMMn . as well as the structure for implementing Algorithm 5 in hardware to reduce
Additionally, KMMn and KSMMn require exponentially fewer operations the hardware cost of the accumulator logic. p is a hardware parameter equal to
than MMn with respect to n, however, KMMn requires fewer operations the number of multiplication products that are pre-accumulated on a smaller
than MMn even starting at n = 2, while KSMMn does not fall below MMn bitwidth to reduce the accumulation complexity before being added to the
until n > 4. full-bitwidth accumulation sum. We use p = 4 in our evaluation.

Algorithm 5 MM1 algorithm with reduced accumulator com- 2w + wa bits for keeping d elements. This reduces the average
plexity used in the baseline MM1 MXUs of all compared bitwidth for every p accumulations to the following:
architectures. p is defined as the number of multiplication
products that are pre-accumulated on a smaller bitwidth to p ACCUM[2w] = ADD[2w+wa ] + (p − 1) ADD[2w+wp ] . (10)
reduce the accumulation complexity before being added to
Furthermore, in systolic-array architectures, each accumulation
the full-bitwidth accumulation sum. We use p = 4 in our
output is buffered in a dedicated register, which adds further
evaluation.
1: function MM1 (A, B, p)
hardware complexity to the accumulation operation. However,
2: for i = 0; i < M ; i ++ do the number of required accumulation registers when using
3: for j = 0; j < N ; j ++ do Algorithm 5 is also reduced by a factor of p as shown in the
4: Ci,j = 0 hardware implementation from Fig. 6 in Sections IV-A since
5: for k = 0; k < K; k += p do the accumulation result only needs to be buffered after being
6: x=0
7: for q = 0; q < p; q ++ do added to the full running sum kept on 2w + wa bits.
8: x += Ai,k+q × Bk+q,j
9: end for IV. KMM H ARDWARE A RCHITECTURES
10: Ci,j += x In this section, we present a general family of hardware
11: end for
12: end for
architectures for efficiently exploiting the KMM algorithm
13: end for in hardware and derive metrics for analyzing the area or
14: return C execution time benefits of the KMM architectures. The first
15: end function type of KMM architecture, described in Section IV-B, is
a fixed-precision architecture optimized for executing inputs
that are not expected to vary in bitwidth. We then present a
In conventional matrix multiplication, each product of w- precision-scalable KMM architecture in Section IV-C that can
bit elements is added to a running sum kept on 2w + wa bits, more efficiently execute across multiple input precisions for
where wa = ⌈log2 d⌉ and is an extra bitwidth added to account applications where the input bitwidths are expected to vary.
for accumulation in order to accumulate d elements which
adds extra hardware complexity. This means that normally p A. Baseline MM1 Architecture
accumulations of 2w-bit elements will require being added to Fig. 7 shows the internal structure of each baseline MM1
a (2w + wa )-bit running sum and each addition will be on MXU at the core of each KMM architecture, and Fig. 6
2w + wa bits and therefore contain the following complexity: shows the internal structure of the processing elements (PE)s
inside the MM1 MXUs. Fig. 6 also shows the structure for
p ACCUM[2w] = p ADD[2w+wa ] . (9)
how Algorithm 5 from Section III-C can be implemented in
However, the average bitwidth of the addition operations hardware and how the algorithm is able to reduce the hardware
is reduced when using Algorithm 5 for accumulation of p cost of the accumulator logic. This accumulation structure
elements of bitwidth 2w because p elements are first added allows for the number of (2w+wa )-bit accumulation adders
together in isolation on a smaller running sum requiring a and their output registers to be reduced by a factor of p, where
bitwidth of only 2w + wp bits for keeping p elements, where they are instead traded for additions on lower-bitwidth values
wp = ⌈log2 p⌉. Only after this initial pre-sum will this result in the range of 2w to 2w + ⌈log2 p⌉ bits that do not require
then be added to the full running sum that is kept on a larger their output to be buffered in registers.
7

bk,j ...bk+X−1,j+X/p−1
MM1
[w]
MXU KMMn[w] MXU
A1 i,: = a1i,k , ..., a1i−X/p+1,k+X−1
ai,k ...ai−X/p+1,k+X−1
B1 :,j = b1k,j , ..., b1k+X−1,j+X/p−1 A0 i,: B0 :,j
+
+
PE PE PE ci−X/p−3,j As i,: Bs :,j

PE PE PE ci−X/p−4,j+1
[⌊w/2⌋] [⌈w/2⌉+1] [⌈w/2⌉]
KMMn/2 KMMn/2 KMMn/2
MXU MXU MXU

C1 i,: Cs i,: C0 i,:

PE PE PE
ci−X/p−Y −2,j+Y −1

KMM Post-Adder Unit

Fig. 7. Baseline MM1 MXU architecture present at the core of the KMM
architectures, provided for context. X and Y refer to the MXU width and
height in number of multipliers.
Ci,: = ci−X/p−3,j , ..., ci−X/p−Y −2,j+Y −1

B. Fixed-Precision KMM Architecture Fig. 8. Fixed-precision KMM architecture for executing on inputs of a fixed
Fig. 8 shows the proposed fixed-precision KMM architec- precision of w bits.
ture for executing on inputs of a fixed precision of w bits that
are not expected to vary in bitwidth. Rather than having one KMM Post-Adder Unit
MXU with w-bit-input multiplier units, this architecture con- Cs i,: C0 i,: C1 i,:
sists of three sub-MXUs that compute matrix multiplication
on either ⌊w/2⌋, ⌈w/2⌉+1, or ⌈w/2⌉-bit inputs. w
− ≪
The additions on lines 7 and 8 of Algorithm 4 are performed ⌈w/2⌉
on X scalar adders at the MXU inputs. Similarly, the additions ≪

on lines 13 and 14 of Algorithm 4 are performed on Y scalar

adders at the MXU outputs. Due to the nature of right/left +
Ci,:
shifting by a constant offset in custom hardware, the shift
operations at the output of the MXUs do not require any area Fig. 9. KMM Post-Adder Unit from Fig. 8 for executing C1 i,: ≪w +
overhead. If desired, each of the three sub-MXUs can also

Cs i,: − C1 i,: − C0 i,: ≪⌈w/2⌉ + C0 i,: .
be instantiated as another KMM MXU containing three more
sub-MXUs to implement additional levels of KMM recursion.
The final level of MXUs will be MM1 MXUs. 1) MM1 and MM2 Mode: If w ≤ m, the architecture will
execute the MM1 algorithm, bypassing any MXU input/output
C. Precision-Scalable KMM Architecture addition or shifting steps, A0 and B0 will be fed into the
Fig. 10 shows the proposed precision-scalable KMM archi- MXU as inputs, and each set of input tiles is read only once.
tecture for implementing one level of KMM recursion. This If 2m − 2 < w ≤ 2m, the architecture will execute the
architecture can more efficiently use m-bit-input multipliers MM2 algorithm and each set of input matrix tiles will be read
to execute across varying input precisions of bitwidth w for a total of four times before proceeding to the next set of input
applications where the input bitwidths are expected to vary. tiles. The MM2 algorithm is used instead of KMM2 for this
Unlike in prior works [12], [13], [14], the minimum possible input bitwidth range because the bitwidth of the elements in
execution time when fully utilizing the compute resources the As and Bs matrices in Algorithm 4 would be m+1 which
scales less than quadratically with the input bitwidths. As would be too large by 1 bit to fit onto the m-bit multipliers
discussed further in Section IV-D, the input matrices are in the MXU. In each read for this input bitwidth range, the
divided into tiles and fed into the MXU one-by-one to perform MXU will accept either the A1 and B1 inputs or the A0 and
GEMM. In this architecture, each set of input matrix tiles may B0 inputs depending on the tile read iteration t. A1 and B1
be read multiple times and either the MM1 , MM2 , or KMM2 will contain bits 2m − 1 down to m of the A and B matrix
algorithm may be executed depending on the input bitwidths elements. A0 and B0 will contain bits m − 1 down to 0 of
w and the multiplier bitwidth m. An iteration state signal t is the A and B matrix elements.
reset when a new set of input tiles is read and is incremented The MXU output vectors Cx i,: in Fig. 10 will be equal
each time the same set of input tiles is re-read. to either (C1 i,: ≪ 2m), (C10 i,: ≪ m), (C01 i,: ≪ m), or
8

[w,m] Each partial matrix tile product will need to be accumulated

KMM2 MXU with prior ones outside of the MXU, however, this functional-
A1 i,: = a1i,k , ..., a1i−X/p+1,k+X−1 ity will already be present in GEMM accelerators as explained
B1 :,j = b1k,j , ..., b1k+X−1,j+X/p−1 A0 i,: B0 :,j above in Section IV-C1.
+ A precision-scalable MM2 architecture can also be imple-
+ mented that has a similar structure as the precision-scalable
As i,: Bs :,j
KMM architecture, except that it will only either execute
the MM1 algorithm if w ≤ m or the MM2 algorithm if
state(w,m,t)
m < w ≤ 2m. We also note that a precision-scalable
2(m-1) KSMM architecture exploiting KSM2 would not be as efficient
m-1
2m m0 to implement in hardware compared to a precision-scalable
KMM architecture. This is because, in addition to the extra
adders that would be required at the output/inputs of every
≪ m-1
multiplier as discussed in Section III-B, multiplexers would
[m] also have to be placed at the output/inputs of every multiplier
MM1 MXU ≪

in the MXU as well for output/input arbitration depending

on the width of the inputs. In contrast, the KMM architecture
reduces this extra adder complexity as already discussed, and it
−
can employ an efficient more conventional systolic array at the
core not requiring multiplexers surrounding each multiplier.
Cx i,: = cxi−X/p−3,j , ..., cxi−X/p−Y −2,j+Y −1

Fig. 10. Precision-scalable KMM architecture for more efficiently using m- D. System Integration
bit-input multipliers to execute across varying input precisions of bitwidth w
for applications where the input bitwidths are expected to vary.
In order to perform GEMM on an MXU and multiply
matrices of arbitrary sizes that can be larger than the MXU
C0 i,: depending on the tile read iteration t to incrementally dimensions, the input matrices are divided into tiles and fed
execute lines 11-13 of Algorithm 3 throughout the tile read to the MXU one-by-one. Following each tile multiplication,
iterations, where m is considered equivalent to the value of the partial tile products are accumulated outside of the MXU
⌈w/2⌉ in Algorithm 3. Specifically, depending on the tile read to generate each final matrix product tile. Prior to each tile
iteration t, the MXU output vectors will be equal to (C1 i,: ≪ multiplication, a B tile is loaded into the MXU. It then remains
2m) to form the addition on line 11 of Algorithm 3, C0 i,: to in place as the A tile flows through the MXU producing the
form the addition on line 13, and separately (C10 i,: ≪ m) or tile product, during which a new Ai,: vector is fed into the
(C01 i,: ≪ m) to collectively form the addition on line 12. MXU each clock cycle. Additionally, to hide the latency of
Each partial matrix tile product will need to be accumulated loading B tiles, the MXU PEs each contain one extra b buffer
with prior ones outside of the MXU, however, this is the to load the next B tile into the MXU as the current tile is being
same functionality already present in GEMM where multiple multiplied, where each extra b buffer in the PEs will hold one
matrix tile products must be summed to form a final matrix individual element of the next B tile after it is loaded.
product, and this functionality will therefore already be present The presented KMM architectures are illustrated for un-
in GEMM accelerators outside of the MXU such as in the signed integer inputs, however, if the inputs are signed, a 1-
GEMM and ML accelerator system from our prior work [6]. dimensional adder vector can be used to add a constant offset
2) KMM2 Mode: If m < w ≤ 2m − 2, the architecture will to the inputs of an MXU to convert them to unsigned. The
execute the KMM2 algorithm and each set of input matrix zero-point adjuster method from our previous work [6] can
tiles will be read a total of three times before proceeding then be used to efficiently eliminate the effects of this constant
to the next set of input tiles. For each read, the MXU will offset in the matrix products before exiting the MXU.
accept or form either the A1 and B1 inputs, the As and We use an ML accelerator system design based on the
Bs inputs, or the A0 and B0 inputs depending on the tile one from our previous work [6], which has open-source code
read iteration t. A1 and B1 will contain bits 2(m − 1) − 1 available [15], to house and evaluate the KMM and baseline
down to m − 1 of the A and B matrix elements. A0 and B0 MXU architectures. We were able to swap the precision-
will contain bits m − 2 down to 0 of the A and B matrix scalable KMM MXU architecture from Fig. 8 into our system
elements. The MXU output vectors Cx i,: in Fig. 10 will be
design [6] in place of the free-pipeline fast inner-product
equal
to either i,: ≪ 2(m − 1)) − (C1 i,: ≪ (m
(C 1 − 1)) , (FFIP) MXU. This change was mostly seamless but also
Cs i,: ≪ (m − 1) , or C0 i,: − (C0 i,: ≪ (m − 1)) depend- required updates to the memory system such that each set
ing on the tile read iteration t to incrementally execute lines of input matrix tiles can optionally be re-read up to three or
12-14 of Algorithm 4 throughout the tile read iterations, where four times before proceeding to the next set of input tiles.
m − 1 is considered equivalent to the value of ⌈w/2⌉ in The number of times that the matrix tiles are re-read and the
Algorithm 4. purpose for this is explained in Section IV-C.
9

E. Multiplier Compute Efficiency of instantiated multipliers in the design, and f is the clock
In this subsection, we define a performance-per-area metric frequency that the hardware design is operating at. Conven-
called the multiplier compute efficiency in (12) which we use tional algorithms used in prior work to perform larger w-
to compare the KMM architecture against baseline designs bit multiplications on smaller m-bit multipliers are the SM
and prior works. The metric is used to compare the amount or MM algorithms (Algorithm 1 and 3). The number of
of computational work that can be performed per compute m-bit multiplications required to carry out a larger w-bit
area regardless of the clock frequency or input bitwidths. The multiplication using conventional algebra (i.e. SM or MM) is
importance of this property is expanded upon more later in equal to the number of w-bit multiplication in the execution
this subsection, as well as in Section V-A. times 4r , where r is equal to:
The hardware complexity of fixed-point multipliers typi- r = ⌈log2 n⌉ = ⌈log2 ⌈w/m⌉⌉ . (13)
cally scale quadratically with the input bitwidth compared
to linearly for adders and registers [16], [17], [18], causing The limit (also referred to as the roof) of the metric in (12)
the hardware footprint of multipliers to dominate that of when executing the conventional MM algorithm in hardware
adders and registers. Due to this, multipliers and MAC units is then the following since it has no algebraic optimizations
are commonly the area-dominant computational resources in for reducing the computational complexity:
deep learning and GEMM-based accelerators [2], [3], [4]. m-bit mults/multiplier
Therefore, we derive a performance-per-area metric defined MM[w]
n roof = 1 . (14)
clock cycle
below for quantifying how much the algebraic optimiza-
tions exploited in an architecture reduce the computational In contrast, the KMM algorithm requires only 3r smaller-
complexity of the area-dominant operations (multiplications) bitwidth multiplications to form every w-bit product rather
and measure how effectively an architecture can utilize these than 4r as in MM. Therefore, the multiplier compute efficiency
resources relative to a conventional design using no algebraic can reach the following limit in KMM architectures:
r
optimizations: m-bit mults/multiplier 4
KMM[w] n roof = . (15)
mults/multiplier (mults/s)/#multipliers clock cycle 3
= , (11)
clock cycle f
F. Area Unit (AU) Compute Efficiency
where mults/s above is measured by taking the number of
multiplications required to carry out an execution using con- In this subsection, we define a performance-per-area metric
ventional algebra and dividing it by the measured execution in (23) that accounts for the area overhead of registers, adder
time, #multipliers is the number of instantiated multipliers in units, and multipliers all in a single unit of comparison based
the design, and f is the clock frequency that the hardware around the area of a full adder. Using this abstracted method
design is operating at. for modelling the circuit area allows for a general complexity
The throughput metric in (11) measures the number of w- analysis that is less biased towards one specific implementation
bit multiplications being performed, where w is the algo- platform or technology.
rithm input bitwidths. However, in order to execute KMM We first derive the relative area of adders and registers by
in hardware, the algorithm input bitwidths w must be larger modeling that the area of a w-bit adder will be approximately
than the multiplier bitwidths, and the number of larger w-bit equal to the area of w full adders. We then approximate the
multiplications that can be performed per multiplier will be area of a w-bit flip-flop/register relative to a w-bit adder ac-
lower than the actual effective number of multiplications being cording to approximate transistor counts of full adders versus
performed per multiplier. Therefore, the maximum achievable D-flip-flops based on several sources. While there are different
value for the metric from (11) will vary depending on the input specific implementations for these components, we use the
bitwidths w and is not ideal for reflecting the true amount of approximate transistor count trends for the implementations
computational work being performed per multiplier regardless in prior work [19], [20], [21], where a standard CMOS full
of the input widths. adder uses 28 transistors [19] and a 1-bit flip-flop consumes
To address this, we can instead measure (11) directly in 18-21 transistors [20], [21] (which we then approximate as
terms of effective m-bit multiplications being performed per 19.5), to arrive at the general area estimation shown in (16a)
multiplier, where m may be smaller than the algorithm input and (16b) of 1 flip-flop equalling the area of approximately
bitwidths w. This derives the following metric for measuring 19.5/28 = 0.7 full adders. So long as these area ratios vary
the true amount of effective multiplications being performed within reasonable bounds as found in prior work [19], [20],
per multiplier regardless of the algorithm input bitwidths w: [21], the conclusions from our results do not change.
We then model the approximate area of a w-bit multiplier
m-bit mults/multiplier (m-bit mults/s)/#multipliers
= , circuit based on the area of a w-bit adder. While there
clock cycle f are different possible multiplier circuit implementations, the
(12)
area of multiplier circuits used in practice commonly scale
where m-bit mults/s above is measured by taking the number quadratically with the area of a full adder [16], [17], [18],
of m-bit multiplications required to carry out an execution [22]. Furthermore, the KMM architectures are not tied to being
on w-bit inputs using conventional algebra and dividing it implemented using one specific multiplier circuit type. There-
by the measured execution time, #multipliers is the number fore, in order to provide a more general analysis and insight
10

catering to a broader range of possible KMM implementations, w. Therefore, this addition can be performed at no cost in
we approximate the area of a multiplier based on the general hardware by simply concatenating the two terms together.
trend of equalling the square of the input bitwidths times the The area of the KMM architecture is then:
area of a full adder as shown in (16c). We then arrive at the [⌈w/2⌉]
following general area approximations: Area(KMM[w]
n ) = 2XArea(ADD )
+2Y Area(ADD[2⌈w/2⌉+4+wa ] + ADD[2w+wa ] )
Area(ADD[w] ) = w AU (16a) [⌊w/2⌋] [⌈w/2⌉+1]
+Area(KMMn/2 + KMMn/2 )
Area(FF[w] ) = 0.7 w AU (16b)
[⌈w/2⌉]
Area(MULT[w] ) = w2 AU . (16c) +Area(KMMn/2 ) (22a)
[w] [w]
Area(KMM1 ) = Area(MM1 ) . (22b)
Based on this, we can then derive the AU of each archi-
tecture by substituting in the areas from (16) for each of the Due to the nature of right/left shifting by a constant offset in
corresponding hardware components in the architectures. The custom hardware, the shift operations in the KSMM and KMM
area of a baseline MM1 MXU is then as follows: algorithms do not add additional area in the corresponding
[w]
architectures.
Area(MM1 ) = XY Area(MULT[w] + 3 FF[w] We can now compare the AU compute efficiency limits of
+ ACCUM[2w] ) . (17) the MM1 , KSMM, and KMM architectures using:
Here, the area of an accumulator is based on Algorithm 5 and throughput/Area Unit throughput roof/Area(ARCH)
roof = ,
its implementation in Fig. 6, where the number of accumulator clock cycle f
registers and (2w+wa )-bit accumulation adders in the MXU (23)
are reduced by a factor of p. Based on this, by substituting where ARCH represents one of the mentioned architectures.
in the areas in (16) for the adders and registers forming the Throughput roofs are equal for fixed-precision MM1 , KSMM,
accumulators in Fig. 6, every p accumulators on average then and KMM architectures with equal X/Y MXU dimensions.
contain the following area: Therefore, the value of (23) for each architecture relative to
p Area(ACCUM[2w] ) = (p − 1) Area(ADD[2w+wp ] ) the MM1 architecture can be found through the inverse of its
AU from (17), (20), or (22) relative to the inverse of the MM1
+ Area(ADD[2w+wa ] + FF[2w+wa ] ) . AU in (17) as plotted later in Fig. 12.
(18)
In (17) - (18), X and Y are the MXU width and height in V. R ESULTS
number of multipliers, wp = ⌈log2 p⌉, and wa is the following
additional bitwidth added to account for accumulation: A. Evaluation Metrics
In Section V, we compare the KMM architectures against
wa = ⌈log2 X⌉ . (19) other designs using the multiplier and Area Unit compute
As discussed in Section IV-D, the register requirements in (17) efficiency metrics defined in (12) and (23) from Sections IV-E
are derived from the fact that each PE in the MM1 MXU and IV-F, respectively. These are both used to compare an
will contain registers for buffering the a and b inputs being architecture’s throughput per area capabilities regardless of the
multiplied, as well as one additional b buffer for loading the clock frequency.
next b tile into the MXU as the current tile is being multiplied. Additionally, the multiplier compute efficiency also mea-
sures the amount of computational work being performed
The area of the KSMM architecture, which is a baseline
per compute area regardless of the clock frequency or input
MM1 MXU using KSM multipliers rather then conventional
bitwidths. This is an important quality because prior works
multipliers, is then:
using the same compute platform as us for evaluation only
Area(KSMM[w] [w]
n ) = XY Area(KSMn + 3 FF
[w] evaluate throughput for input bitwidths w that are equal to the
multiplier bitwidths m. However, in order to execute KMM
+ACCUM[2w] ) , (20)
in hardware, the input bitwidths w must be larger than the
where: multiplier bitwidths. Therefore, to fairly compare the perfor-
mance of the prior works against our KMM architecture, we
Area(KSM[w]
n ) = Area(ADD
[2w]
) need to use a performance metric with a maximum achievable
+2 Area(ADD[2⌈w/2⌉+4] + ADD[⌈w/2⌉] ) value that does not change regardless of the input bitwidths
[⌊w/2⌋]
+Area(KSMn/2
[⌈w/2⌉+1]
+ KSMn/2 ) w being executed, which is not the case for the GOPS metric.
Furthermore, the multiplier compute efficiency is also useful
[⌈w/2⌉]
+Area(KSMn/2 ) (21a) for comparison with prior works because it is measurable using
[w] only throughput, number of multipliers, and frequency, which
Area(KSM1 ) = Area(MULT[w] ) . (21b)
are commonly provided or derivable in prior works.
The addition of c0 on line 14 of Algorithm 2 is not included The Area Unit compute efficiency metric also accounts for
in this area estimate because it can be performed before line the area overhead of registers and adder units and provides a
13 where c0 will be on w bits and will not overlap with c1 ≪ more general abstracted method for modelling the circuit area
11

that is less biased towards one specific implementation plat- Table I compares the KMM architecture with state-of-the-art
form or technology. However, it is only useful for comparing accelerators evaluated on the same FPGA family for the same
architectures which compute on inputs of the same bitwidth, instantiated multiplier bitwidths and similar neural network
and it is only derivable when knowing not only the number models. The proposed KMM architecture is very efficient,
of multipliers used in an architecture, but also the number of achieving the highest throughput and compute efficiency com-
adders and registers which is information that is not readily pared to the prior works in Table I. The KMM design here
available from prior works, but we can use it to model the achieves compute efficiencies approaching the KMM2 limit of
efficiencies of the fixed-precision KMM architecture against 1.33 when executing on bitwidths in the range of 9-14 bits that
our baseline designs which we know all of these details about. is derived in (15) and surpasses the limit of 1 in prior works
that is derived in (14).
It is also noted that the proposed systolic arrays in Tables I
B. Comparison to Prior Work and II that are integrated into a full accelerator system include
Although the theoretical concepts presented in this work are a number of other components such as memory subsystems
general and applicable to both custom integrated circuits and and control as described in our prior work [6], and these other
FPGA implementations, our example KMM implementations system components form the frequency-limiting critical path
were validated on FPGA, and we therefore compare against as opposed to the proposed systolic-array architectures.
state-of-the-art prior works that are also evaluated on FPGA. Table II shows an example of how KMM can be combined
As discussed in Section IV-D, we use an ML accelerator with other algebraic techniques to further increase compute
system design based on the one from our previous work efficiency limits. FFIP [6] provides a way to reduce the number
[6], which has open-source code available [15], to house and of required multiplications by a factor of 2, by trading half the
evaluate our example KMM and baseline MXU architectures. multiplications for cheap low-bitwidth additions. Because the
Full system-level validation of the experimental accelerator as number of required multiplications is reduced by 2, the limit
integrated into the system from our previous work [6] has been for the multiplier compute efficiency metric in (15) becomes 2
done on an Arria 10 SoC Developement Kit [23] containing for FFIP, and (8/3)r for FFIP+KMM. In Table II, we combine
the Arria 10 SX 660 device by measuring throughput in real- KMM with FFIP [6] by using an FFIP MXU as the base
time. However, this device contains fewer soft logic resources MXU in the KMM architecture instead of a conventional MM1
than the Arria 10 GX 1150 used in the prior works we MXU to further increase the compute efficiency compared
compare against, and we generate compilation results for our to standalone FFIP. The FFIP+KMM architectures in Table
design on the same Arria 10 GX 1150 device used in prior II have additional memory resources instantiated compared
works for a more fair and consistent comparison. Throughput to the FFIP-only design in order to support inference on
values of our designs on the Arria 10 GX 1150 device are up to 16-bit inputs, and this also adds a penalty in the soft
then calculated using an accurate throughput estimation model logic resources and clock frequency. However, the multiplier
based on our highly deterministic and time-predictable system compute efficiency of the FFIP+KMM designs surpass the
implementation, which accurately predicts actual throughputs FFIP limit of 2, and approach the FFIP+KMM2 limit of 2.67.
measured on the Arria 10 SX 660 device available to us. Tables
I-III show throughputs for ResNet [24] neural network models. C. Comparison to Baseline Designs
In Table I, the number of multipliers in the work from An 1) Precision-Scalable Architectures: Table I includes the
et al. [27] is calculated as #DSP s × 2, where each DSP in resource usage and performance comparison between the
the Intel/Altera FPGAs contains two 18-bit multipliers [28]. proposed KMM and the baseline MM architectures. The
The works from Liu et al. [25] and Fan et al. [26] in Table I multiplier compute efficiency of KMM surpasses that of the
implement a similar method as in the work from Langhammer baseline MM architecture when executing on bitwidths in the
et al. [29] to pack two 8-bit multiplications onto each 18- range of 9-14 bits, achieving compute efficiencies approaching
bit multiplier in the DSPs at the cost of additional ALMs the KMM2 limit of 1.33 that is derived in (15) and surpassing
and registers, and therefore #multipliers = #DSP × 4 in the limit of 1 of the baseline MM architecture and prior works
those works. Our architectures in Table I contains 64×64 + that is derived in (14), validating KMM’s ability to increase
64 multipliers, where 64×64 multipliers are used in the MXU, compute efficiency as expected from our analysis. This is
while the remaining 64 are located outside the MXU in the also reflected in the GOPS from Table I, where the KMM
Post-GEMM Unit [6] for performing inter-layer quantization architecture achieves a 1.33× speedup over MM for input
rescaling functions. This is also how the number of multipliers bitwidths in the range of 9-14 bits.
is calculated in the architectures in Table II, except there the For illustration, Fig. 11 plots the limits of the multiplier
MXUs contain 64×32+32 multipliers due to using the FFIP compute efficiency metric defined in (12) from Section IV-E
method [6]. For the multipliers located in the MXU of our for the precision-scalable KMM2 architecture compared to the
designs in Tables I-II, we also implement a similar method as conventional precision-scalable MM2 architecture for X =
in the work from Langhammer et al. [29] to pack two smaller- Y = 64. As shown, the KMM architecture surpasses the MM
bit multiplications onto each 18-bit multiplier in the DSPs. architecture’s limit of 1 for this metric, extending the limit to
However, we leave one FFIP+KMM design in Table II without 1.33 for bitwidths 9-14 since the KMM2 algorithm requires
this optimization for a more fair comparison to the FFIP design only 3 m-bit multiplications for every w-bit product rather
in our prior work [6] that did not implement this optimization. than 4 as in the MM2 algorithm.
12

TABLE I
P ROPOSED PRECISION - SCALABLE KMM AND BASELINE MM SYSTOLIC - ARRAY ARCHITECTURES INTEGRATED INTO A DEEP LEARNING ACCELERATOR
SYSTEM COMPARED WITH EACH OTHER AND PRIOR STATE - OF - THE - ART DEEP LEARNING ACCELERATORS ON A RRIA 10 GX 1150 FPGA.

[w,8] [w,8]
TNNLS ’22 [25] TCAD ’22 [26] Entropy ’22 [27] MM2 64×64 KMM2 64×64
1
DSP optimization Yes Yes No Yes Yes
DSPs 1473 1473 1503 1056 1056
ALMs 304K 304K 303K 243K 250K
Registers 889K 890K - 556K 562K
Memories 2334 2334 1953 2713 2713
Frequency (MHz) 200 220 172 320 326
Model ResNet- VGG16 Bayes Bayes R-CNN R-CNN ResNet- ResNet- ResNet- ResNet- ResNet- ResNet-
50 ResNet-18 VGG11 (ResNet-50) (VGG16) 50 101 152 50 101 152
Input bitwidth (w) 8 8 8 8 8 8 1-8 / 1-8 / 1-8 / 1-8 / 1-8 / 1-8 /
9-16 9-16 9-16 9-14 / 9-14 / 9-14 /
15-16 15-16 15-16
Throughput (GOPS) 1519 1295 1590 534 719 865 2108 / 2304 / 2390 / 2147 / 2347 / 2435 /
527 576 598 716 / 782 / 812 /
537 587 609
8-bit mults/multiplier 2
0.645 0.550 0.639 0.206 0.696 0.837 0.792 / 0.865 / 0.898 / 0.792 / 0.865 / 0.898 /
clock cycle
0.792 0.865 0.898 1.055 / 1.154 / 1.197 /
0.792 0.865 0.898
1
Determines if the design includes an optimization to pack two smaller-bit multiplications onto the 18-bit multipliers of the DSPs.
2
Multiplier compute efficiency, used to compare the amount of computational work being performed per compute area regardless of the input bitwidths or clock frequency, defined
in (12) from Section IV-E, relevance explained in Section V-A.

tures) by comparing the reduction in number of DSP units for

the KMM and KSMM designs relative to MM1 . Furthermore,
the reduction in addition complexity of KMM relative to
KSMM can be seen in the reduction in ALMs in the KMM
architectures compared to the KSMM architectures.
The MM1 and KSMM architectures innately have a lower
clock frequency than KMM because it is expected that each
multiplication being performed in the PEs require n2 or nlog2 3
DSP units, respectively, whereas the KMM designs require
only 1 DSP unit in each individual KMM systolic-array PE.
This leads to a less localized design. In contrast, the KMM
design uses multiple independent systolic arrays requiring 1
DSP unit per multiplication to perform a single 16 to 18-
Fig. 11. Maximum achievable multiplier compute efficiencies (derived in bit multiplication, and the DSPs in each systolic array do
Section IV-E) for the precision-scalable MM2 and KMM2 architectures.
not require interconnections with the DSPs in other systolic
arrays, leading to a more localized design. Due to this, we
provide results of two design variants for each of the MM1 and
2) Fixed-Precision Architectures: Table III shows synthesis
KSMM architectures, where one variant contains additional
results on a modern Agilex 7 FPGA device for baseline MM1 ,
pipelining registers added into the PE datapaths such that the
KSMM, and proposed KMM systolic-array architectures in
clock frequency can reach closer to that of the KMM designs.
isolation (not integrated into a deep learning accelerator)
However, it can be seen that the MM1 and KSMM designs are
for different input bitwidths and levels of KSM and KMM
still unable to match the frequency of KMM even with extra
recursion. The input bitwidths are intentionally larger than
pipelining registers, especially for the 64-bit input designs.
the DSP units’ native multiplier bitwidths and are chosen to
allow for larger multiplications to be broken down into smaller In summary, the trend in Table III is that the KMM
multiplications of bitwidths at or just below the native widths designs may contain more register resources than the MM1
supported by the DSPs, which house 18-bit multipliers. It is and KSMM designs depending on the amount of pipelining
expected that the larger-bit multiplications in the MM1 designs registers used, however, the KMM designs use significantly
will be mapped to smaller 16-bit multipliers, and onto fewer fewer ALM resource than the KSMM designs, significantly
16 to 18-bit multipliers in the KMM and KSMM designs. fewer DSP units than the MM1 designs, and achieve signifi-
The reduction in multiplication complexity of KMM and cantly higher clock frequencies than both KSMM and MM1 .
KSMM achieved through breaking down larger multiplications Fig. 12 also provides a more general modelling of the
into smaller-bitwidth multiplications can be seen relative to performance-per-area of the KMM architectures that is less
conventional approaches (evaluated through the MM1 architec- biased towards one specific implementation platform or tech-
13

TABLE II
C OMPARISON OF AN FFIP [6] SYSTOLIC ARRAY, WHICH DOUBLES PERFORMANCE PER MAC UNIT, WITH COMBINED FFIP+KMM
PRECISION - SCALABLE SYSTOLIC ARRAYS WHEN INTEGRATED INTO DEEP LEARNING ACCELERATOR SYSTEMS ON A RRIA 10 GX 1150 FPGA.

[w,8] [w,8]
TC ’24 [6] (FFIP 64×64) FFIP+KMM2 64×64 FFIP+KMM2 64×64
1
DSP optimization No No Yes
DSPs 1072 1072 552
ALMs 118K 133K 205K
Registers 311K 334K 502K
Memories 1782 2445 2713
Frequency (MHz) 388 353 341
Model ResNet- ResNet- ResNet- ResNet- ResNet- ResNet- ResNet- ResNet- ResNet-
50 101 152 50 101 152 50 101 152
Input bitwidth (w) 8 8 8 1-8 / 1-8 / 1-8 / 1-8 / 1-8 / 1-8 /
9-14 / 9-14 / 9-14 / 9-14 / 9-14 / 9-14 /
15-16 15-16 15-16 15-16 15-16 15-16
Throughput (GOPS) 2529 2752 2838 2325 / 2542 / 2637 / 2246 / 2455 / 2547 /
775 / 847 / 879 / 749 / 818 / 849 /
581 635 659 562 614 637
8-bit mults/multiplier 2
1.521 1.655 1.707 1.536 / 1.679 / 1.742 / 1.536 / 1.679 / 1.742 /
clock cycle
2.048 / 2.239 / 2.322 / 2.048 / 2.239 / 2.322 /
1.536 1.679 1.742 1.536 1.679 1.742
1
Determines if the design includes an optimization to pack two smaller-bit multiplications onto the 18-bit multipliers of the DSPs.
2
Multiplier compute efficiency, used to compare the amount of computational work being performed per compute area regardless
of the input bitwidths or clock frequency, defined in (12) from Section IV-E, relevance explained in Section V-A.

TABLE III
C OMPARISON OF PROPOSED FIXED - PRECISION KMM AND BASELINE MM1 AND KSMM SYSTOLIC - ARRAY ARCHITECTURES IN ISOLATION ( WITHOUT
INTEGRATION INTO A DEEP LEARNING ACCELERATOR SYSTEM ) ON AGILEX 7 FPGA.

[32] [32] [32] [32] [32] [64] [64] [64] [64] [64]
MM1 MM1 KSMM2 KSMM2 KMM2 MM1 MM1 KSMM4 KSMM4 KMM4
32×32 32×32 32×32 32×32 32×32 32×32 32×32 32×32 32×32 32×32
Input bitwidth 32 32 32 32 32 64 64 64 64 64
DSPs 2048 2048 1536 1536 1536 8704 8704 4608 4608 4608
ALMs 64K 69K 138K 147K 68K 240K 266K 554K 557K 212K
Registers 165K 225K 306K 481K 257K 237K 712K 447K 1126K 806K
Frequency (MHz) 450 569 386 537 622 203 341 147 345 552
Throughput roof 922 1165 791 1100 1274 416 698 302 707 1131
(GOPS)
All designs in this table consume 0 memory resources and are synthesized for an Agilex 7 AGIA040R39A1E1V device.

nology by plotting the AU compute efficiency limits derived in

Section IV-F that can be achieved for the fixed-precision MM1 ,
KSMM, and KMM architectures for different supported fixed-
precision input widths and instantiated multiplier bitwidths for
X = Y = 64. The KMM and KSMM architectures for each
bitwidth implement as many levels of Karatsuba recursion
as possible while still reducing the area, with a minimum
of least one level of Karatsuba recursion being implemented
(even if the one level has a larger area than using conventional
MM1 ). This results in one recursion level being implemented
in the KSMM architectures for every bitwidth. For the KMM
architectures, this results in one recursion level for bitwidths
8-32, two recursion levels for bitwidths 40-56, and three
recursion levels for bitwidth 64.
As can be seen, the KMM architecture achieves a higher
throughput per Area Unit than the conventional MM1 archi-
Fig. 12. Maximum achievable AU compute efficiencies (derived in Section tecture starting sooner at a lower bitwidth compared to the
IV-F) for the fixed-precision MM1 , KSMMn , and KMMn architectures. KSMM architecture, and it is consistently higher than the
KSMM architecture across all input/multiplier bitwidths.
14

VI. C ONCLUSION [15] T. E. Pogue and N. Nicolici, “FFIP accelerator implementation,” 2023.
[Online]. Available: https://github.com/trevorpogue/algebraic-nnhw
In this work, we propose the extension of the scalar [16] V. Lakshmi et al., “A novel in-memory wallace tree multiplier architec-
Karatsuba multiplication algorithm to matrix multiplication, ture using majority logic,” IEEE Trans. Circuits Syst. I, vol. 69, no. 3,
showing how this maintains the reduction in multiplication pp. 1148–1158, 2022.
[17] K. Guo et al., “[DL] a survey of FPGA-based neural network inference
complexity of the original Karatsuba algorithm while re- accelerators,” ACM Trans. Reconfigurable Technol. Syst., vol. 12, no. 1,
ducing the complexity of the extra additions. Furthermore, pp. 1–26, 2019.
we propose new matrix multiplication hardware architectures [18] K. Pekmestzi, “Multiplexer-based array multipliers,” IEEE Trans. Com-
put., vol. 48, no. 1, pp. 15–23, 1999.
for efficiently exploiting the proposed algorithm in custom [19] F. Moradi et al., “Ultra low power full adder topologies,” in Proc. IEEE
hardware, showing that they can provide real area or execution Int. Symp. Circuits Syst. (ISCAS), 2009, pp. 3158–3161.
time improvements for integer matrix multiplication compared [20] N. Kawai et al., “A fully static topologically-compressed 21-transistor
flip-flop with 75% power saving,” IEEE J. Solid-State Circuits, vol. 49,
to designs implementing scalar Karatsuba or conventional no. 11, pp. 2526–2533, 2014.
matrix multiplication algorithms. The proposed architectures [21] Y. Cai et al., “Ultra-low power 18-transistor fully static contention-free
are well suited for increasing the efficiency in acceleration single-phase clocked flip-flop in 65-nm CMOS,” IEEE J. Solid-State
Circuits, vol. 54, no. 2, pp. 550–559, 2019.
of modern workloads that can decompose to large matrix [22] B. Parhami, Computer arithmetic: Algorithms and hardware designs.
multiplications on integer arithmetic, such as the computation- Oxford University Press, 2010.
ally dominant portion of convolutional neural networks or the [23] “Intel Arria 10 SoC development kit.” [Online]. Avail-
able: https://www.intel.ca/content/www/ca/en/products/details/fpga/
attention mechanism of transformer models [30]. We provide development-kits/arria/10-sx.html
a complexity analysis of the algorithm and architectures and [24] K. He et al., “Deep residual learning for image recognition,” in Proc.
evaluate the proposed designs both in isolation and in an IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2016, pp. 770–
778.
end-to-end accelerator system relative to baseline designs and [25] S. Liu et al., “Toward full-stack acceleration of deep convolutional neural
prior state-of-the-art works, showing how they increase the networks on FPGAs,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33,
performance-per-area of matrix multiplication hardware. no. 8, pp. 3974–3987, 2022.
[26] H. Fan et al., “FPGA-based acceleration for bayesian convolutional
neural networks,” IEEE Trans. Comput.-Aided Design Integr. Circuits
R EFERENCES Syst., vol. 41, no. 12, pp. 5343–5356, 2022.
[27] J. An et al., “An OpenCL-based FPGA accelerator for Faster R-CNN,”
[1] A. Fuchs and D. Wentzlaff, “The accelerator wall: Limits of chip Entropy, vol. 24, no. 10, p. 1346, 2022.
specialization,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit. [28] “Intel Arria 10 native fixed point DSP IP core user guide,” 2017.
(HPCA), 2019, pp. 1–14. [Online]. Available: https://www.intel.com/content/www/us/en/docs/
[2] X. Liu et al., “WinoCNN: Kernel sharing Winograd systolic array for programmable/683583/current/intel-arria-native-fixed-point-dsp-ip.html
efficient convolutional neural network acceleration on FPGAs,” in Proc. [29] M. Langhammer et al., “Extracting INT8 multipliers from INT18
IEEE 32nd Int. Conf. Appl.-Specific Syst., Arch. Processors (ASAP), multipliers,” in Proc. IEEE Int. Symp. Field-Programmable Gate Arrays.
2021, pp. 258–265. IEEE, 2019, pp. 114–120.
[3] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor [30] A. Vaswani et al., “Attention is all you need,” in Proc. Annu. Conf.
processing unit,” in Proc. 44th Annu. Int. Symp. Comput. Archit. (ISCA), Neural Inf. Process. Syst.,, 2017, pp. 5998–6008.
2017, pp. 1–12.
[4] T. Norrie et al., “The design process for Google’s training chips: TPUv2
and TPUv3,” IEEE Micro, vol. 41, no. 2, pp. 56–63, 2021.
[5] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
works,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR),
Trevor E. Pogue Trevor E. Pogue received the
2016, pp. 4013–4021.
B.Eng. degree in Electrical Engineering and the
[6] T. E. Pogue and N. Nicolici, “Fast inner-product algorithms and archi-
M.A.Sc. degree in Electrical and Computer En-
tectures for deep neural network accelerators,” IEEE Trans. Comput.,
gineering from McMaster University, Hamilton,
vol. 73, no. 2, pp. 495–509, 2024.
Canada, in 2016 and 2019, respectively. He worked
[7] A. A. Karatsuba and Y. P. Ofman, “Multiplication of many-digital
as an intern at Synopsys and AMD in 2018 and
numbers by automatic computers,” in Proc. Doklady Akademii Nauk,
2022-2023, respectively. He is currently a Ph.D.
vol. 145, no. 2. Russian Academy of Sciences, 1962, pp. 293–294.
Candidate in the Department of Electrical and Com-
[8] R. Jain and N. Pandey, “Approximate Karatsuba multiplier for error-
puter Engineering at McMaster University, Hamil-
resilient applications,” AEU - International Journal of Electronics and
ton, Canada. His research interests are in the area of
Communications, vol. 130, p. 153579, 2021.
hardware acceleration.
[9] R. Jain et al., “Booth-encoded Karatsuba: A novel hardware-efficient
multiplier,” Advances in Electrical and Electronic Engineering, vol. 19,
no. 3, pp. 272–281, 2021.
[10] N. Jouppi et al., “TPU v4: An optically reconfigurable supercomputer
for machine learning with hardware support for embeddings,” in Proc.
50th Annu. Int. Symp. Comput. Archit. (ISCA), 2023, pp. 1–14. Nicola Nicolici (S’99-M’00-SM’11) Nicola Nicolici
[11] C. Zhang et al., “Caffeine: Toward uniformed representation and ac- (S99-M00-SM’11) received the Dipl.Ing. degree in
celeration for deep convolutional neural networks,” IEEE IEEE Trans. Computer Engineering from the “Politehnica” Uni-
Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 11, pp. 2072– versity of Timisoara, Romania, in 1997 and the
2085, 2019. Ph.D. degree in Electronics and Computer Science
[12] K. Li et al., “A precision-scalable energy-efficient bit-split-and- from the University of Southampton, U.K., in 2000.
combination vector systolic accelerator for NAS-optimized DNNs on He is currently a Professor with the Department
edge,” in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), 2022, of Electrical and Computer Engineering, McMaster
pp. 730–735. University, Hamilton, Canada. His research interests
[13] W. Li et al., “Low-complexity precision-scalable multiply-accumulate are in the area of computer-aided design and test.
unit architectures for deep neural network accelerators,” IEEE Trans. He has authored a number of papers in this area. Dr.
Circuits Syst. II, Exp. Briefs, 2022. Nicolici was the recipient of the IEEE TTTC Beausang Award for the Best
[14] Y. Umuroglu et al., “Bismo: A scalable bit-serial matrix multiplication Student Paper at the International Test Conference in 2000 and the Best Paper
overlay for reconfigurable computing,” in Proc. 28th Int. Conf. Field Award at the IEEE/ACM Design Automation and Test in Europe Conference
Program. Log. Appl. (FPL, 2018, pp. 307–3077. in 2004.

View publication stats

Cufsm Advanced Functions
No ratings yet
Cufsm Advanced Functions
34 pages
Formula Sheet - ANOVA, Chi-Square & Regression
No ratings yet
Formula Sheet - ANOVA, Chi-Square & Regression
1 page
Unit 18
No ratings yet
Unit 18
4 pages
Open Electives Winter 2024-25-28nov - Open Electives Winter 2024-25 - 28nov-1
No ratings yet
Open Electives Winter 2024-25-28nov - Open Electives Winter 2024-25 - 28nov-1
1 page
Computer Arithmetic
No ratings yet
Computer Arithmetic
34 pages
High-Performance Matrix-Vector Multiplication On The GPU: Abstract
No ratings yet
High-Performance Matrix-Vector Multiplication On The GPU: Abstract
10 pages
10.O'Zaro Bog Lanishlarni Statistik O'rganish
No ratings yet
10.O'Zaro Bog Lanishlarni Statistik O'rganish
81 pages
Applsci 14 03323 v2
No ratings yet
Applsci 14 03323 v2
15 pages
Resize-Pdf - Base Paper 6 - Copy-Numbered
No ratings yet
Resize-Pdf - Base Paper 6 - Copy-Numbered
13 pages
Statistical Industry Classification
No ratings yet
Statistical Industry Classification
44 pages
Success Mantra - Class - 10 All Subject
No ratings yet
Success Mantra - Class - 10 All Subject
126 pages
Hybrid Travel Recommendation Algorithm Based On Center Aggregation Parameters
No ratings yet
Hybrid Travel Recommendation Algorithm Based On Center Aggregation Parameters
7 pages
Notes Key Topic 1.3 Rates of Change Linear and Quadratic Functions Ap PC
No ratings yet
Notes Key Topic 1.3 Rates of Change Linear and Quadratic Functions Ap PC
2 pages
Laplace Table
No ratings yet
Laplace Table
2 pages
Milestone 1
No ratings yet
Milestone 1
3 pages
How I Created The Theory of Relativity
100% (1)
How I Created The Theory of Relativity
4 pages
Energy Efficient Time Domain Vector by Matrix Mult
No ratings yet
Energy Efficient Time Domain Vector by Matrix Mult
7 pages
Case Problem 3
No ratings yet
Case Problem 3
5 pages
Most Resource Efficient Matrix Vector Multiplication On FPGAs
No ratings yet
Most Resource Efficient Matrix Vector Multiplication On FPGAs
18 pages
1.3 Translational Equilibrium Statics
No ratings yet
1.3 Translational Equilibrium Statics
55 pages
6.977 Networks and Dynamics: Professor, Vdb@mit - Edu Professor, Verghese@mit - Edu
No ratings yet
6.977 Networks and Dynamics: Professor, Vdb@mit - Edu Professor, Verghese@mit - Edu
39 pages
Iiser K SOP PDF
No ratings yet
Iiser K SOP PDF
2 pages
PLUTO: Cloud Assist Parallel Encryption For Data in Mobile Cloud Computing
No ratings yet
PLUTO: Cloud Assist Parallel Encryption For Data in Mobile Cloud Computing
12 pages
OM Chapter3 PM - Students
No ratings yet
OM Chapter3 PM - Students
37 pages
FPGA Based Modified Karatsuba Multiplier
No ratings yet
FPGA Based Modified Karatsuba Multiplier
6 pages
A RISC-V Matrix Multiplier Using Systolic Arrays
No ratings yet
A RISC-V Matrix Multiplier Using Systolic Arrays
41 pages
A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
No ratings yet
A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
13 pages
Physics Paper 1 Quarter Year Examination 2013 Marking Scheme
No ratings yet
Physics Paper 1 Quarter Year Examination 2013 Marking Scheme
23 pages
Taxicab Geometry
No ratings yet
Taxicab Geometry
3 pages
ISYE 2028 Chapter 8 Solutions
100% (2)
ISYE 2028 Chapter 8 Solutions
41 pages
The Problem of Punctuation in Modern English
No ratings yet
The Problem of Punctuation in Modern English
18 pages
2024-26 - Jr.C-120 - Physics Teaching & Test Schedule With Class & Home Work
No ratings yet
2024-26 - Jr.C-120 - Physics Teaching & Test Schedule With Class & Home Work
30 pages
Factored Systolic Array Tensor Processing
No ratings yet
Factored Systolic Array Tensor Processing
7 pages
GATE Electromagnetic Theory Book
No ratings yet
GATE Electromagnetic Theory Book
12 pages
Marginal Rate of Technical Substitution
No ratings yet
Marginal Rate of Technical Substitution
9 pages
CBSE Class 8 Maths Activity 4
No ratings yet
CBSE Class 8 Maths Activity 4
2 pages
Iare Iare Ads Lecture Notes
No ratings yet
Iare Iare Ads Lecture Notes
86 pages
MAT 1100 Inequalities - 2020
No ratings yet
MAT 1100 Inequalities - 2020
15 pages
EEEN 201 Lecture Notes-08
No ratings yet
EEEN 201 Lecture Notes-08
10 pages
Systolic Array Architecture For Educational Use
No ratings yet
Systolic Array Architecture For Educational Use
6 pages
Quarter 1-Module 5: Mathematics
100% (1)
Quarter 1-Module 5: Mathematics
14 pages
Memcached Architecture and Deployment: Definitive Reference for Developers and Engineers
From Everand
Memcached Architecture and Deployment: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
From Everand
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Hyperdrive Architecture and Implementation: The Complete Guide for Developers and Engineers
From Everand
Hyperdrive Architecture and Implementation: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
From Everand
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Carvel Ytt in Action: The Complete Guide for Developers and Engineers
From Everand
Carvel Ytt in Action: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Efficient Workload Management with SGE: Definitive Reference for Developers and Engineers
From Everand
Efficient Workload Management with SGE: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Strimzi Essentials: The Complete Guide for Developers and Engineers
From Everand
Strimzi Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Tarantool Architecture and Development: Definitive Reference for Developers and Engineers
From Everand
Tarantool Architecture and Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Queue Structures and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Queue Structures and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
From Everand
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Litmus Chaos Engineering for Kubernetes: The Complete Guide for Developers and Engineers
From Everand
Litmus Chaos Engineering for Kubernetes: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cortex for Scalable Multi-Tenant Metrics: The Complete Guide for Developers and Engineers
From Everand
Cortex for Scalable Multi-Tenant Metrics: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
From Everand
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
StarPU: Parallel Computing and Task Scheduling Techniques
From Everand
StarPU: Parallel Computing and Task Scheduling Techniques
Richard Johnson
No ratings yet
RabbitMQ in Practice: Definitive Reference for Developers and Engineers
From Everand
RabbitMQ in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
From Everand
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Red Hat AMQ Streams for Cloud-Native Messaging: The Complete Guide for Developers and Engineers
From Everand
Red Hat AMQ Streams for Cloud-Native Messaging: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Real-Time Applications with FreeRTOS: Definitive Reference for Developers and Engineers
From Everand
Real-Time Applications with FreeRTOS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kestra Pipeline Orchestration Essentials: The Complete Guide for Developers and Engineers
From Everand
Kestra Pipeline Orchestration Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
EtherChannel Configuration and Optimization: Definitive Reference for Developers and Engineers
From Everand
EtherChannel Configuration and Optimization: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
BentoML Adapter Integrations for Machine Learning Frameworks: The Complete Guide for Developers and Engineers
From Everand
BentoML Adapter Integrations for Machine Learning Frameworks: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Architecting Distributed Applications with Macrometa: The Complete Guide for Developers and Engineers
From Everand
Architecting Distributed Applications with Macrometa: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Swarm Deployment and Orchestration: Definitive Reference for Developers and Engineers
From Everand
Swarm Deployment and Orchestration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kuma Service Mesh in Depth: Definitive Reference for Developers and Engineers
From Everand
Kuma Service Mesh in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kops for Enterprise Kubernetes Cluster Management: Definitive Reference for Developers and Engineers
From Everand
Kops for Enterprise Kubernetes Cluster Management: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Foundations of Emulation: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Emulation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
QEMU Virtualization Essentials: Definitive Reference for Developers and Engineers
From Everand
QEMU Virtualization Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
From Everand
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
From Everand
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Tcpdump in Depth: Definitive Reference for Developers and Engineers
From Everand
Tcpdump in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kubeflow Operations and Workflow Engineering: Definitive Reference for Developers and Engineers
From Everand
Kubeflow Operations and Workflow Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Chaos Mesh for Resilient Kubernetes Deployments: The Complete Guide for Developers and Engineers
From Everand
Chaos Mesh for Resilient Kubernetes Deployments: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Cortex-A Architecture and System Design: Definitive Reference for Developers and Engineers
From Everand
Cortex-A Architecture and System Design: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Litmus Chaos Experiments in Practice: The Complete Guide for Developers and Engineers
From Everand
Litmus Chaos Experiments in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Minikube in Practice: Definitive Reference for Developers and Engineers
From Everand
Minikube in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SystemTap Essentials: Definitive Reference for Developers and Engineers
From Everand
SystemTap Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
From Everand
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Embedded Systems Programming with C: Writing Code for Microcontrollers
From Everand
Embedded Systems Programming with C: Writing Code for Microcontrollers
Larry Jones
No ratings yet
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
From Everand
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Puma Deployment and Configuration Guide: Definitive Reference for Developers and Engineers
From Everand
Puma Deployment and Configuration Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming the MSP430 Microcontroller: Definitive Reference for Developers and Engineers
From Everand
Programming the MSP430 Microcontroller: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming Atmel Microcontrollers: Definitive Reference for Developers and Engineers
From Everand
Programming Atmel Microcontrollers: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Embedded Systems Programming with C++: Real-World Techniques
From Everand
Embedded Systems Programming with C++: Real-World Techniques
Robert Johnson
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet

Karatsuba Matrix Multiplication and Its Efficient Custom Hardware Implementations

Uploaded by

Karatsuba Matrix Multiplication and Its Efficient Custom Hardware Implementations

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Karatsuba Matrix Multiplication and its Efﬁcient Custom Hardware

Article in IEEE Transactions on Computers · January 2025

The user has requested enhancement of the downloaded file.

Karatsuba Matrix Multiplication and its Efficient

ai+1,k bk,j−1 ai+1,k+1 bk+1,j−1 ai+1,k+p−1 bk+p−1,j−1

w w load w w load w w load

ai,k bk,j ai,k+1 bk+1,j ai,k+p−1 bk+p−1,j

PEk PEk+1 PEk+p−1

C1 i,: Cs i,: C0 i,:

KMM Post-Adder Unit

on lines 13 and 14 of Algorithm 4 are performed on Y scalar

[w,m] Each partial matrix tile product will need to be accumulated

in the MXU as well for output/input arbitration depending

tures) by comparing the reduction in number of DSP units for

nology by plotting the AU compute efficiency limits derived in

View publication stats

You might also like