Accelerating 128-Bit Floating-Point Matrix Multiplication On Fpgas

This research focuses on accelerating General Matrix Multiplication (GEMM) using binary128 arithmetic on FPGAs to enhance performance in applications like semidefinite programming (SDP). The implemented designs achieved approximately 90GFlops, significantly outperforming traditional CPU computations, and were integrated into practical applications such as LU decomposition and SDP. The study highlights the advantages of FPGAs in terms of flexibility and energy efficiency compared to GPUs for high-precision arithmetic tasks.

Uploaded by

Shafi Mn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views12 pages

Accelerating 128-Bit Floating-Point Matrix Multiplication On Fpgas

Uploaded by

Shafi Mn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Accelerating 128-bit Floating-Point Matrix

Multiplication on FPGAs
Fumiya Kono∗§ , Naohito Nakasato† , Maho Nakata‡
∗ Shizuoka Institute of Science and Technology, Fukuroi, Shizuoka, JAPAN
† The University of Aizu, Aizuwakamatsu, Fukushima, JAPAN
‡ Cluster for Pioneering Research, RIKEN, Wako, Saitama, JAPAN
§ [email protected]

Abstract—General Matrix Multiplication (GEMM) is a fun- On the other hand, operations with higher precision, such as
damental operation widely used in scientific computations. Its
arXiv:2306.04087v1 [cs.DC] 7 Jun 2023

binary128, are also required by specific applications. One ex-

performance and accuracy significantly impact the performance ample is Semidefinite Programming (SDP), a natural extension
and accuracy of applications that depend on it. One such appli-
cation is semidefinite programming (SDP), and it often requires of linear programming that aims to minimize linear functions
binary128 or higher precision arithmetic to solve problems subject to certain constraints. In semidefinite programming,
involving SDP stably. However, only some processors support it is common to solve given problems using the Primal-Dual
binary128 arithmetic, which makes SDP solvers generally slow. Interior-Point Methods (PDIPM) [3]. However, according to
In this study, we focused on accelerating GEMM with binary128 these methods, SDP is numerically unstable near the optimal
arithmetic on field-programmable gate arrays (FPGAs) to enable
the flexible design of accelerators for the desired computations. solution because the variable matrices become singular [4], [5].
Our binary128 GEMM designs on a recent high-performance Therefore, Nakata [5] proposed using higher precision num-
FPGA achieved approximately 90GFlops, 147x faster than the bers to solve optimization problems using SDP to maintain
computation executed on a recent CPU with 20 threads for the desired numerical accuracy.
large matrices. Using our binary128 GEMM design on the However, since few processors represented by the IBM z13
FPGA, we successfully accelerated two numerical applications:
LU decomposition and SDP problems, for the first time. processor [6] support binary128 as hardware, the performance
Index Terms—Matrix Multiplication, binary128, Systolic Ar- of applications relying on binary128 arithmetic is typically
rays, Intel FPGA SDK for OpenCL, Performance Benchmarking, 100 to 1000x slower than that only relying on binary64.
LU Decomposition, Semidefinite Programming Therefore, the acceleration of binary128 arithmetic is crucial
for accelerating SDP.
I. I NTRODUCTION In this research, we implemented GEMM in binary128
arithmetic on Field Programmable Gate Arrays (FPGAs). The
General Matrix Multiplication (GEMM) is a crucial com- advantage of targeting FPGAs is their flexibility in optimizing
putation in various scientific and engineering algorithms. Its accelerators for target computations. Additionally, while GPUs
precision plays a significant role in determining the accuracy are designed with many parallel processors and fast memories,
of the target applications. Different applications have different FPGAs are simply an array of logic gates that allow us to
precision requirements for the number of bits used to represent reconfigure designs and how they work during computation.
floating-point (FP) numbers. As defined by the IEEE 754 This characteristic of FPGA enables us to create a suitable
standard [1], FP formats and arithmetic are available in various design for specific calculations while minimizing the use of
precisions, including binary16 (also known as half-precision), hardware resources. As a result, energy consumption during
binary32 (single-precision), binary64 (double-precision), and computation on FPGAs is typically much lower than on GPUs.
binary128 (quadruple-precision). The suffix in each format Nagasu et al. [7] compared the energy consumption of
indicates the number of FP bits supported by the respective FPGA and GPU computations for the same tsunami modeling
format, with higher numbers indicating higher precision. application and demonstrated the effectiveness of FPGAs.
In machine learning (ML) using artificial neural networks, They showed that their implementation on the Arria10 FPGA
it has been shown that binary16 is sufficient for storing the consumed approximately 5x less energy than the initial imple-
weights of these networks. This has led to the development mentation on an AMD Radeon GPU.
of hardware architectures that support highly parallel compu- Implementing logic designs on FPGAs is typically more
tation using binary16 arithmetic. One example is the Tensor- challenging than parallel programming on GPUs because logic
Core on recent NVIDIA graphics processing units (GPUs), designs must be written in Hardware Description Language
designed for matrix multiplication with lower precision and (HDL). To alleviate this difficulty, we adopt Intel’s OpenCL-
has multiplication and accumulation performed in binary16 based high-level synthesis (HLS) techniques for our binary128
and binary32 arithmetics, respectively. Other ML accelerators, GEMM designs in this research.
such as Google’s TPUv3 [2], also support the bfloat16 format, To design high-performance GEMM operations on FPGAs,
an extended half-precision FP format. it is essential to utilize pipeline parallelism and create a
systolic array [8]. Matteis et al. [9] developed FBLAS, a Nakasato [12] accelerated the GEMM routines for binary32,
numerical library inspired by the open-source implementation binary64, and 128-bit double-double (DD) [13], [14] preci-
of the Basic Linear Algebra Subroutines (BLAS) for Intel sion on the AMD Cypress GPU. Also, Nakata et al. [15]
FPGAs. FBLAS also provides a version of the systolic array presented a fast GEMM implementation in DD precision on
design for its GEMM implementation. In this research, we NVIDIA GPUs. In the paper, they have applied their GEMM
extended it to support various FP precisions. implementation in DD precision to the algorithm in SDP.
The OpenCL standard supports neither arithmetic operations Kouya [16] implemented LU decomposition supporting multi-
of binary128 nor that of higher precision than binary128. precision floating-point numbers such as DD, triple-double
Furthermore, the OpenCL standard only supports arithmetic (TD), and quad-double (QD). With AVX vectorization, the
operations in binary32 and binary64 [1]. While a recent implementation successfully accelerated the LU decomposi-
version of the OpenCL SDK for Intel FPGAs supports specific tion for Intel and AMD CPUs.
FP precisions, its main target is binary16 and bfloat16 for Joldes et al. [17] developed CAMPARY, a multi-precision
machine learning. arithmetic library for NVIDIA GPUs based on the CUDA
In this research, we adopted customized FP units developed programming model, which supports DD, TD, and QD preci-
by Nakasato et al. [10] that support various FP formats, in- sion. Isupov and Knyazkov have been working on MPRES-
cluding the binary128 format. Nevertheless, this paper focused BLAS for NVIDIA GPUs [18], which is an interval evaluation
on developing and evaluating binary128 format FP addition, for the fractional representation of numbers in the Residue
multiplication units, and acceleration of binary128 GEMM Number System (RNS) [19] to represent arbitrary precision
operations. numbers. MPRES-BLAS was the fastest among CAMPARY
The main contributions of this research are as follows: and CUMP [20] GEMM implementation for 424-bit precision.
• We implemented fast GEMM designs in the binary128 Mukunoki et al. [21] also had proposed a fast GEMM im-
format on FPGAs plementation in binary128 or less precision based on the Ozaki
• We developed an application interface compatible with scheme [22], an accurate GEMM algorithm by representing
the standard BLAS library. FP numbers as non-overlapping sums of FP numbers. They
• We evaluated the performance of our binary128 GEMM showed the performance evaluation of their method on CPUs
designs with practical applications. and prospects of extension for GPUs.
While this research builds upon the preceding work, we However, research for GEMM in high-precision arithmetic
successfully integrated our binary128 GEMM designs into on FPGA has yet to be seen. Licht et al. [23] targeted
MPLAPACK [11], an extension of all BLAS and LAPACK Xilinx FPGAs to implement GEMM by using systolic array
(Linear Algebra PACKage) routines to support multi-precision designs. Afterward, they experimented with their GEMM to
FP operations, including binary128. Therefore, the designs can support various FP precision up to 1024-bits [24] extending
also be immediately used in numerical applications that utilize the implementation of the Multiple Precision Floating-Point
MPLAPACK as a backend. Reliable (MPFR) [25] library. Although their motivation lies
Our binary128 GEMM design implemented on Terasic in the acceleration of an SDP solver, the practical evaluation
DE10a-Net Agilex FPGA achieved 90.9GFlops by utilizing of their designs still needs to be done.
maximum hardware resources. Furthermore, its integration to
III. M ATRIX M ULTIPLICATION FOR FPGA
practical applications of blocked LU decomposition and SDP
contributed to at most 5.3x and 2x speed-up compared with the A. Implementation
computation on a recent Intel i9-10900 CPU with 20 threads The GEMM routine in BLAS performs matrix multiplica-
parallelization by OpenMP, respectively. tion for matrices A and B as follows:
This paper first presents a brief specification of our bi-
nary128 GEMM designs. Then, to inspect the fundamental C = αAB + βC, (1)
characteristics of the designs, we first evaluate their per-
formance on Terasic DE5a-Net Arria10 FPGA. Based on where α and β are scalar parameters. Listing 1 presents
the analysis obtained by this evaluation, we focus on more the API in C language to the GEMM routine for multi-
practical benchmarking by using Nallatech (BittWare) 520N precision FP numbers called Rgemm provided by MPLAPACK
Stratix10 FPGA, which is installed on a supercomputer system [11]. Note that Float128 is the standard data type in C
in operation, and Agilex FPGA, the latest and high-end Intel language for binary128, as defined in ISO/IEC TS 18661-
FPGA. Finally, we discuss the applications of our binary128 3:2015 [26]. MPLAPACK utilizes Float128 through the GNU
GEMM design by integrating it to blocked LU decomposition C++ compiler via GNU extensions. The first two arguments
and SDP problems. specify the transpose operation of matrices A and B. The three
arguments lda, ldb, and ldc represent the leading dimensions
II. R ELATED W ORKS of matrices A, B, and C, respectively.
The study of GEMM in high-precision arithmetic is a pop- In the practical implementation of the GEMM routine,
ular topic in multiple-precision research, but previous studies calculating the matrix multiplication AB is a critical part of
have mainly focused on CPU or GPU implementations. its computation. Assume that we have two matrices A and B
Listing 1. RGEMM interface
1 v o i d Rgemm ( c o n s t char * t r a n s a , c o n s t char * t r a n s b ,
2 int const m, i n t c o n s t n , i n t c o n s t k , Float128 const alpha ,
3 Float128 *a , i n t c o n s t lda , F l o a t 1 2 8 *b , i n t c o n s t ldb ,
4 Float128 const beta , F l o a t 1 2 8 *c , i n t c o n s t l d c ) ;

with sizes m × k and k × n, respectively. Then, an element of fully unrolling the loop, the main kernel defines the systolic
the resulting matrix C ′ = AB is computed by the summation array. Because the computation task of a PE is just a multiply-
as follows: add operation, we can replace the multiply-add operation in
k−1
′
X the original design with any multiply-add unit for a desired
Cij = Aip × Bpj , (2)
FP format. This enables us to create a systolic array design
p=0
corresponding to the designated precision.
where i, j, and p are indices ranging 0 ≤ i < m, 0 ≤ j < n, In addition, to replace the multiply-add operation, we mod-
and 0 ≤ p < k, respectively. The calculation of the whole ify and extend the other three kernels for the Read, Feed, and
matrix C ′ involves a 3-level nested loop. Store modules to support a wider memory bus for binary128
arithmetic. We also extend the original kernels to optimize
load and store operations from DRAM. The Read and Feed
kernels are equipped with a memory buffer in front of the
Feed module. In the original design, the memory buffer is
called a memory tile and explicitly instantiated as a 1-D
array. The memory tile acts as a cache memory to store
a sub-matrix of A and reuse the sub-matrix many times. The
exploitation of the memory tile reduces the pressure on the
memory bandwidth of DRAM and improves the performance
of our binary128 GEMM designs, as shown in the later section.
The number of PEs in the present systolic array is PR ×PC .
We instantiate PR × PC binary128 multiply-add units. The
additional computations in the definition of the GEMM, as
shown in Eq. (1), require the computation of two scaler-matrix
multiplications and one matrix addition, which are very costly
Fig. 1. Systolic Array Design for the GEMM operation
in a GEMM design on an FPGA. In the present systolic array,
Fig. 1 illustrates the design of a systolic array for our we need additional PC multiply units for αA, a load unit for C,
binary128 GEMM design derived from FBLAS [9]. This PC multiply units for βC, and PC add units for the summation
design is characterized by a 2-D array of processing elements of αA and βC. Except for the multiply units for αA, which
(PE) aligned PC ×PR . Each PE calculates Eq. (2) for assigned can be merged with the Feed module, the other units are
sub-matrices of A and B. The size of the sub-matrices A and only activated in the final stage of the GEMM operation
B and the value of PC ×PR determine how the input matrices at the Store module. Therefore, in this research, we only
are partitioned. calculate Eq. (2) on an FPGA, while the host CPU handles the
In the computation flow, the input matrices A and B are read transpose operations and other additional operations involving
from main memory via the Read module and sent to the PEs α and β. Supporting those additional operations, we develop
through the Feed module. A is sent by column, and B is sent an API that is compatible with the standard Rgemm provided
by row, assuming that both matrices are not transposed. They by MPLAPACK. It enables us to use our binary128 GEMM
are first received by PEs with IDs (PR − 1, 0) or (0, PC − 1) designs immediately in numerical applications with minimal
and forwarded to the adjacent PEs in the systolic array on each changes.
clock cycle. Each PE accumulates the result of a multiply-
add operation for the same element in C ′ and sends it to the B. Performance Models
Drain module, which is eventually collected by the Store
module to be written back to the main memory. Here, we summarize the performance models for our bi-
More specifically, FBLAS is a generator of OpenCL kernels nary128 GEMM design. In this section, f represents the clock
for the systolic array. The generated systolic array consists of frequency of the logic circuit design in MHz.
four OpenCL kernels: two kernels that combine the Read and 1) Performance of GEMM: The peak performance of the
Feed modules for A and B, one Store kernel for C, and designs depends on the layout of systolic arrays, as shown in
a main kernel for the array of PEs and Drain module. The Fig. 1. When we use PR × PC PEs, the peak performance
main kernel explicitly calls a function for one PE in a loop. By Fpeak (GFlops) is given by Eq.(3).
Eq. (6), we calculate the average L1 norm of the difference
2 × PR × PC × f × 106 between two n × n matrices as EL1 throughout the evaluation.
Fpeak = (3)
109
n−1
X n−1
The measured performance Fperf of the designs in GFlops
X
F R
Cij − Cij
is calculated by Eq. (4), where Texec is the execution time in i=0 j=0
seconds. EL1 = , (6)
n2
2mnk In Eq. (6), C F and C R denote the result matrices by our
Fperf = (4) implementation for FPGAs and Rgemm, respectively. EL1
Texec × 109
allows us to determine how accurately our binary128 GEMM
In Eq. (2), m, n and k denote the matrix size parameters. designs match the results of the reference implementation.
For the multiplication of n × n square matrices, the number To highlight the main characteristics of computational per-
of FP operations is 2n3 . formance, we begin by evaluating the designs on the Arria10
2) Memory Bandwidth Requirement: The performance of FPGA in this section. The following section covers the perfor-
the designs is also affected by memory bandwidth of an FPGA mance evaluation of the designs on newer FPGAs, including
board. PR × PC systolic array takes PR + PC inputs conveyed Stratix10 and Agilex.
by two vertical and horizontal Feed pipelines at every cycle.
Thus, the required memory bandwidth Breq (GB/s) is given B. Benchmarking Results on Arria10
by Eq. (5). 1) Evaluation for Square Matrices: We present benchmark-
ing results for our binary128 GEMM designs. The systolic ar-
(PR + PC ) × f × 106 × NByte ray consists of PEs arranged in a square with PR = PC = 2, 4,
Breq = (5)
109 and 8. Table II shows the logic synthesis results on the Arria10
NByte represents the word size established as 16 bytes in the FPGA system.
present work. If the systolic array consists of 8 × 8 PEs, Breq Our binary128 GEMM design requires more DSP blocks
equals 256f × 10−3 GB/s. For example, the requirement Breq for larger PE arrays. Therefore, the number of available DSP
becomes 51.2GB/s for the design where the clock frequency blocks is the primary constraint for the design. The row labeled
f is 200MHz. To fully utilize all PEs in the designs, Breq Fmax shows the clock frequency of each design. Therefore,
must be smaller than the memory bandwidth of a target FPGA their peak performance Fpeak is shown in the last row based
board. on Eq. (3).

IV. P ERFORMANCE E VALUATION

This section presents the performance evaluation of our
binary128 GEMM designs on three FPGA systems.

A. Benchmarking Conditions
1) Target FPGA Systems: Table I shows the specification of
FPGAs used in this benchmarking: Terasic DE5a-Net Arria10,
Nallatech (BittWare) 520N Stratix10, and Terasic DE10a-Net
Agilex. The Stratix10 FPGA is a computation node of Cygnus,
a supercomputer system operating at the University of Tsukuba
in Japan since 2019. We use Intel FPGA SDK for OpenCL
to design and implement our binary128 GEMM designs. A
different host system hosts each FPGA as specified in the
bottom rows of Table I.
Fig. 2. Performance of our binary128 GEMM designs for square matrices
2) Evaluation Method: We first evaluate our binary128 on Arria10 FPGA
GEMM designs for square matrices by scaling n. Also, we
evaluate the performance of multiplying non-square matrices Fig. 2 shows the performance of each design on Arria10.
with sizes m × k and k × n as more realistic and practical The matrix size n ranges from 64 to 4096. The performance of
evaluations. To calculate the performance in GFlops, Eqs. (3) designs Fperf with 2×2, 4×4, and 8×8 PEs is at a maximum
and (4) are used. The computation time Texec in Eq. (4) is of 1.88, 7.1, and 15.0GFlops, respectively. Since each PE can
the average of three trials in each benchmarking. As a target work independently for data streaming and operations on the
of comparison, we use a baseline of the Rgemm executed on systolic array, the performance is proportional to the number
the host system of Agilex (i9-10900 CPU) with 20 threads by of PEs in the design.
OpenMP parallelization. However, with a small n, the computation load for each PE
Besides, we compare numerical accuracy with the Rgemm is not sufficiently high to reach the maximum performance
routine provided by MPLAPACK on a CPU. As shown in of the designs. It reaches the peak at a specific n, such as
TABLE I
S PECIFICATION OF FPGA SYSTEMS IN OUR PERFORMANCE EVALUATION
FPGA Arria10 Stratix10 Agilex
Logic cells 427,200 933,120 487,200
DSP blocks 1,518 5,760 4,510
M20K RAM blocks 2,713 11,721 7,110
Memory bits (total) 55,562,240 240,046,080 145,612,800
Board Memory 2× DDR3-1066 8GB 4× DDR4-2400 8GB 4× DDR4-2666 8GB
Board Memory Bandwidth 34.2GB/s 76.8GB/s 85.2GB/s
PCIe Gen2 x8 Gen3 x8 Gen3 x16
Quartus ver. 19.1 20.4 21.1
CPU i7-2600K Xeon Gold 6226 i9-10900
Ncore /Nthread 4/8 16/32 10/20
Host memory size 16GB 192 GB 64 GB
Host OS Ubuntu 18.04.6LTS CentOS 7.9.2009 Ubuntu 20.04.5LTS
gcc ver. 8.4.0 7.4.0 9.4.0

TABLE II
S YNTHESIS RESULTS OF EACH PR × PC DESIGNS ON A RRIA 10 FPGA

PR × PC 2×2 4×4 8×8

MTile 32 32 32
Logic cells 49,523 (12%) 78,624 (18%) 201,033 (47%)
DSP blocks 78 (5%) 270 (18%) 1,037 (68%)
Memory bits 3,390,704 (6%) 3,841,264 (7%) 5,341,616 (10%)
RAM blocks 397 (15%) 514 (19%) 551 (20%)
Fmax (MHz) 236.29 228.15 201.28
Fpeak (GFlops) 1.89 7.30 25.76

TABLE III
S YNTHESIS RESULTS OF EACH PR × PC DESIGNS ON S TRATIX 10 AND AGILEX FPGA S

FPGA Stratix10 Agilex

PR × PC 8×8 8 × 16 8×8 8 × 16
MTile 128 256 128 512
Logic cells 354,019 (38%) 524,014 (56%) 288,693 (59%) 414,416 (85%)
DSP blocks 1,038 (18%) 2,062 (36%) 1,293 (29%) 2,061 (46%)
Memory bits 18,294,816 (8%) 32,191,072 (13%) 18,969,348 (13%) 79,467,844 (55%)
RAM blocks 1,292 (11%) 1,792 (15%) 1,510 (21%) 4,289 (60%)
Fmax (MHz) 259.06 177.14 411.52 388.95
Fpeak (GFlops) 33.16 45.35 52.67 99.57

n = 2048 for 8×8 PEs, and the performance scaling becomes of 2 × 2 and 4 × 4 PEs. As a result, their Fperf is close to the
flat at larger n. peak. However, the design of 8 × 8 PEs requires 51.5GB/s,
We then evaluate the numerical error EL1 of computation which is 1.5x more significant than the available bandwidth.
results between our binary128 GEMM designs and the Rgemm Therefore, the design of 8 × 8 PEs is limited by memory
routine based on Eq. (6). EL1 for n < 512 is distributed transfer from DRAM. As a result, we see that the ratio between
between 10−31 and 10−30 . As we set n to 4096, EL1 increases Fperf and Fpeak is much lower than that of other designs of
to 2.0 × 10−28 . The layout of PEs does not make a significant fewer PEs.
difference in EL1 . 2) Effects of Memory Buffer for The Systolic Array:
Regarding the comparison between Fperf and Fpeak , a ratio To enhance performance, we instantiate more PEs in our
to designs of 2 × 2, 4 × 4, and 8 × 8 PEs is 99.5%, 97.3%, binary128 GEMM design. However, the memory bandwidth
and 58.2%, respectively. Recall that the memory bandwidth of the FPGA board poses a limitation. Therefore, the systolic
requirement Breq is given by Eq. (5). As we substitute Fmax of array generated by FBLAS has a module called memory
each design in Fig. 2 with f in Eq. (5), we find Breq 15.1GB/s, tile in front of the Feed module. It is a local memory
29.2GB/s and 51.5GB/s for 2×2, 4×4 and 8×8, respectively. buffer working as a cache memory for each PE to mitigate
Our Arria10 system has two DDR3 memories that provide the memory bandwidth requirements provided in Eq. (5). As
34.2GB/s of the total bandwidth. It is sufficient for the designs the systolic array incorporates a more significant number of
PEs, increasing the size of MTile is necessary to provide the
larger buffer in our binary128 GEMM designs.
The results presented in Sec. IV-B1 were all obtained by
the designs with MTile = 32. We then conduct additional
benchmarking to further investigate the potential performance
improvement by adopting a larger value of MTile . Fig. 3
illustrates the performance of the GEMM by using the designs
of 4 × 4 and 8 × 8 PEs where MTile ranges from 24 to 256.

Fig. 4. Performance of our binary128 GEMM designs on Arria10 FPGA for

non-square matrices where n ranges from 32 to 4096

However, the multiplication on the design of 8 × 8 PEs

clearly shows a performance degradation for any n. In partic-
ular, for the computations of tall-skinny matrices where n is
much smaller than k, the design of 8×8 PEs performs far from
its maximum capacity. The performance is as low as that of
2 × 2 PEs. When we similarly fix m and n to m = n = 4096
Fig. 3. Performance of our binary128 GEMM designs on Arria10 FPGA with
MTile = 24 to 256 and scale k between 32 to 4096, the computation of each
design shows the same result as in Fig. 4.
The figure shows the performance of each design for
four matrices where (k, n) = (4096, 512), (4096, 2048), C. Benchmarking Results on Stratix10 and Agilex
(2048, 2048), (4096, 4096) assuming m = k. Computations We then evaluate our binary128 GEMM designs on
using the design of 4 × 4 PEs are not affected by the change Stratix10 and Agilex FPGAs under the same benchmarking
of MTile since their Breq (30.25GB/s) is within the board conditions. Based on the previous evaluation of Arria10, the
memory bandwidth (34.2GB/s). designs targeted in this section are 8 × 8 PEs with MTile =
On the other hand, we see that using a larger MTile ≥ 64 128. Additionally, we implemented a design of 8 × 16 PEs
improves the performance of the 8 × 8 PEs. In those cases, with MTile = 256 and 512 to utilize the abundant hardware
the performance increases by 1.5 to 2x compared to the design resources on Stratix10 and Agilex. However, their resources
with MTile = 32 and reaches its peak at MTile = 128. In con- are still insufficient to implement 16×16 PEs due to the limited
trast, the smaller MTile ≤ 24 causes even lower performance. number of available logic cells.
For the square matrix with n = 4096, we achieved 21.6GFlops Table III summarizes the logic synthesis results of our
at MTile = 128, 84% of Fpeak in Table II. We also see that this designs implemented on each FPGA. As we increase the size
MTile scaling is effective in multiplying tall-skinny matrices of the memory buffer on each PE by scaling MTile , the
where n is relatively much smaller than k. The larger MTile utilization of memory bits and RAM blocks on the FPGAs
reduces a bottleneck of the current implementation to some accordingly increases. However, this does not cause problems
extent. on the Stratix10 and Agilex FPGA systems when we set
3) Evaluation for Non-square matrix: In computation of MTile = 512 for 8 × 16 PEs. As a result, Fmax and Fpeak
square matrices, we found that the performance of our bi- for our binary128 GEMM designs on Stratix10 and Agilex are
nary128 GEMM designs was ideal, except for the memory much higher than those on Arria10.
bandwidth constraint caused by the large PE layout. We Fig. 5 shows the performance of our binary128 GEMM
then evaluate the performance for non-square matrices. Fig. 4 designs on the two FPGAs. On FPGA systems of Stratix10
shows the result gained by multiplications of m × k and k × n and Agilex, we could execute GEMM with the size of a
matrices where m and k are fixed at m = k = 4096 and only maximum n = 24576 thanks to their large board memory.
n is varied between 32 and 4096. In this evaluation, we set For comparison, we plot the performance on a host CPU (i9-
MTile = 128 in all designs. 10900) in the Agilex FPGA system.
In the case of multiplication with rectangular matrices, We first focus on results for Stratix10. The design of 8 × 8
the current systolic array design is ineffective due to load PEs with MTile = 128 almost reached its peak performance
imbalance among PEs. However, when the layout of PEs is at n = 4096. The performance scaling for larger n is at
small, such as 2 × 2 PEs, the performance does not drop even 32.8GFlops, 99% of the peak. 8 × 16 PEs with MTile = 256
for multiplication with 4096 × 128 compared to 4096 × 4096. similarly reached a peak of 45.0GFlops at around n = 12000.
Fig. 5. Performance of our binary128 GEMM designs for square matrices
Fig. 6. Performance of our binary128 GEMM designs on Stratix10 and Agilex
on Stratix10 and Agilex FPGAs
FPGAs for non-square matrices where k ranges from 32 to 16384

However, compared to the design of 8×8 PEs, its performance like k ≤ 128, even the performance on Agilex is just a few
improvement is sluggish because the Fmax of the 8 × 16 PEs GFlops. As a result, the advantage of our binary128 GEMM
significantly dropped and led to a low Fpeak of the design. designs compared to computation on CPUs is lost.
As we examine the performance of the designs on Agilex,
the optimization of PE layout and MTile successfully con- V. A PPLICATION OF BINARY 128 M ATRIX
tributed to performance improvement. While the design of M ULTIPLICATION
8 × 8 PEs with MTile = 128 certainly performs effectively, Once we have our binary128 GEMM designs by the systolic
that of 8 × 16 PEs with MTile = 512 is much better. The array architecture, we can accelerate practical applications
computation by the 8 × 16 PEs achieved 90.9GFlops, 91% of which require binary128 GEMM operations. We here describe
the peak, for the largest matrix size of n = 24576 in contrast two applications of our implementation with performance
to one by the 8 × 8 PEs yielding 50.4GFlops at n = 18000, evaluation. In this section, Rn×n denotes n × n real matrices.
about 96% of its peak.
The importance of the size of MTile can be easily under- A. Blocked LU Decomposition
stood by comparing it with a reference plot for the design 1) Problem Specification of LU Decomposition: The LU
of 8 × 16 PEs with MTile = 128 on Agilex. If we set Decomposition is a fundamental operation in numerical analy-
MTile = 128, the performance of the design is at most sis that factorizes the given square matrix A as a multiplication
77GFlops, which is only 77% of the peak. In particular, a of lower and upper triangular matrices like A = LU where L
trench in the plot at n = 16384 results in a significant and U are lower and upper triangular matrices, respectively.
performance drop to 54.1GFlops around that point. One reason Based on BLAS routines, the LU decomposition in binary64
may be that those specific large matrices accidentally cause precision is implemented as a routine called dgetrf in LA-
accesses that stride over different memory banks on four PACK. The degetf routine adopts a blocked LU decomposition
independent DIMMs on the Agilex FPGA board. However, the algorithm thoroughly investigated and implemented for every
memory buffer exploited by the larger MTile (e.g. 512) helps supercomputer in the last four decades. Its variation is the most
to alleviate problems related to unexpected memory access famous parallel benchmarking program called LINPACK. The
patterns and facilitates steady performance improvement. blocked LU decomposition algorithm effectively solves dense
Finally, our binary128 GEMM design is very high per- linear equations on accelerator architectures like GPU since
formance compared to the Rgemm routine executed on the its computation is mainly processed as GEMM operations.
CPU with 20 threads. Its performance settles at 650MFlops Let us consider the LU decomposition for a matrix A ∈
for n > 1024. Therefore, we have a significant advantage Rn×n with the block size b, as shown in Fig. 7. Then, we
in processing large matrices. The design of 8 × 16 PEs with obtain L and U on A by repeating the following procedure
MTile = 512 on Agilex is 145x faster than the computation recursively.
on a recent CPU with the maximum number of threads. 1) Divide A into 4 sub-matrices: A11 ∈ Rb×b , A12 ∈
In addition, we show the performance of our binary128 Rb×(n−b) , A21 ∈ R(n−b)×n , and A22 ∈ R(n−b)×(n−b) .
GEMM designs for non-square matrices on Stratix and Agilex 2) Perform decomposition A11 = L11 U11 .
FPGAs. Fig. 6 shows the benchmarking result when m and 3) Solve U12 that satisfies L11 U12 = A12 .
n are fixed to m = n = 16384, and k is scaled between 32 4) Solve L21 that satisfies L21 U11 = A21 .
and 16384. As presented in the benchmarking on Arria10, the 5) Update A22 by A22 = A22 − L21 U12 .
performance drop for ratios of n : k < 2 : 1 is not significant. 6) If n − b > 0 still holds, go back to step 1 after
However, for tall-skinny matrices where k is particularly small, substituting A with A22 .
Fig. 8. Performance of LU decomposition on Stratix10 and Agilex FPGAs
Fig. 7. Blocked LU Decomposition of a matrix A where the block size is b

In step 5, we have matrix multiplication L21 U12 . When b = We observe that b = 108 yields the best performance on
1, the blocked LU decomposition is reduced to a non-blocked the Agilex FPGA as represented by 2.5GFlops at n = 20000.
routine called dgetrf2 in LAPACK. When b is large enough, the However, with a large matrix of n = 24576, a higher b
computation of dgetrf is dominated by GEMM operations in yields the peak. We can see in the figure that the highest
step 5. Accordingly, it can be accelerated by GEMM routines performance is 2.6GFlops obtained with b = 144 for the
on GPUs or FPGAs. matrix of n = 24576. On the other hand, the performance
In MPLAPACK [11], all BLAS and LAPACK routines are deteriorates when we apply even larger values of b such
extended to support multi-precision FP operations, including as b = 192 and 256, yielding 2.3GFlops and 2.1GFlops,
binary128. We modify an extended version of dgetrf in MPLA- respectively. Similarly, the design on the Stratix10 FPGA is
PACK called Rgetrf, which calls the Rgemm routine. In this superior to the CPU computation for n > 3000. Although it
paper, we replace calls to Rgemm with our binary128 GEMM is slower than the computation on the Agilex FPGA, it finally
operations executed on FPGAs. reaches 2.2GFlops at n = 20000, which is 4.7x faster than
The number of FP operations in the LU decomposition that of the CPU.
2n3 n2 5n 2n3 Since the performance on FPGAs improves slowly by
algorithm is − + [27]. Here, we regard it as .
′
3 2 6 3 scaling n until computation data saturate every PE, the per-
Therefore, Fperf as shown in Eq. (7) gives the computation
performance for the following evaluation. formance on the CPU for small n is superior to that of
FPGAs. When the matrix size n = 512, the smallest size in
′ 2n3 this evaluation, the performance on the CPU is 278MFlops
Fperf = (7) which is 2 to 3x faster than that of FPGAs. We see that
3 × Texec × 109
2) Evaluation of GEMM for LU Decomposition: We as- the intersection of the performance scaling between the CPU
sume that an input n × n matrices whose elements are given and FPGAs is around n = 1536. The performance of the
by random numbers in a range of [0.0, 1.0). Then, the input CPU execution does not improve for n > 2000, which is
matrices can be factorized by the LU decomposition. We 458MFlops at n = 24576. In contrast, the performance of the
decompose the square matrices by applying our binary128 LU decomposition by using our binary128 GEMM designs on
GEMM designs in the algorithm. Agilex FPGA is at a maximum of 5.3x faster than that of the
Based on the evaluation in the previous section, we measure CPU.
the performance of blocked LU decomposition with the design We compare the decomposed matrices L and U calculated
of 8×16 PEs on Agilex FPGA. We scale the size of matrices n by the designs on FPGAs with the reference result calculated
and apply different block sizes b to find the optimal size of b. by the CPU by using Eq. (6). In the case of n ≤ 1536, where
As a comparison, we present a result on the design of 8 × 16 the CPU computation is still faster than FPGAs, we find EL1
PEs on Stratix10 where b = 128. We also give an another ∼ 10−31 . On the other hand, as we test for the matrix of
comparison with a result obtained through computation using n = 24576, we find EL1 ∼ 10−28 . This consequence is the
only the host CPU (Intel Core i9-10900). In that computation, same as we expected, considering the previous evaluation of
the Rgetrf routine in MPLAPACK takes charge of the LU our binary128 GEMM design.
decomposition with 20 threads by OpenMP parallelization. Finally, we compare our results with those of previous work
Fig. 8 summarizes our results of the LU decomposition. by Kouya [16], who presented optimizations of LU decom-
For Agilex FPGA, we present the performance in each case position using DD arithmetic. Specifically, they have applied
of b = 108, 128, 144. The black line shows the performance memory blocking and vectorization using AVX2 instructions
scaling obtained by the computation on the CPU. and evaluated the performance on an Intel Core i9-10900X
CPU. According to their benchmarking for n = 1024, the We test different Nmin and find that Nmin = 106 to 107
performance of a conventional blocked LU decomposition is optimal for the SDPA. We only present the performance
code with b = 64 was 132MFlops. Similarly, the performance benchmarking of the SDPA on Agilex FPGA for selected
of a vectorized LU decomposition code with b = 32 was problems from SDPLIB shown in Table IV. We present the
363MFlops. In contrast, our result with the design of 8 × 16 elapsed time per iteration of the SDPA-binary128 on the three
PEs achieved 324.5MFlops for n = 1024 and b = 108 on systems: CPU-A (Intel Xeon Gold 5122 4 cores @ 3.60GHz),
an Agilex FPGA. Even the fastest design on the high-end CPU-B (Intel i9-10900 CPU 10 cores @ 2.80GHz), and CPU-
FPGA is not significantly beneficial for small matrices. As B using our binary128 GEMM design of 8×16 PEs on Agilex.
a result, from performance perspective for small matrices, our The performance with the FPGA is 2 to 4x and roughly 1.5x
binary128 GEMM designs are inferior to the vectorized LU faster than that of CPU-A and CPU-B, respectively. Note that
decomposition code on a CPU. the performance of SDPA-binary128 on CPUs is proportional
However, we emphasize that our designs on recent FPGAs to the number of cores on a given CPU.
are much more effective for large n. With the current best We verify that each solution computed by our binary128
performance of our LU decomposition being 2.5GFlops, our GEMM design improves upon the solution obtained via
FPGA designs are superior for large matrices. It is also worth double-precision calculations. As illustrated in Table V, we
noting that our work and the work by Kouya [16] use different present the relative gaps, primal/dual feasible errors, and the
FP formats. DD arithmetic is well suited for recent high-end numbers of iterations for problems theta2, theta3, theta4,
CPUs equipped with vector arithmetic units such as AVX2 theta6, and controll11 from SDPLIB, as computed on CPU-
and AVX512 instructions on the x86-64 ISA, Neon, and SVE B using binary128, FPGA (Agilex) using our design, the DD
instructions on the ARM ISA. precision version [5], and the double precision version [32].
As smaller errors indicate better results, the solutions obtained
B. Semidefinite Programming (SDP) via our binary128 GEMM design exhibit an improvement over
SDP is an optimization problem to minimize or maximize those obtained via double precision calculations and are of
a given linear function under the constraint of symmetric comparable or slightly superior quality to those obtained via
semidefinite matrices. It has vast possible applications in DD arithmetic. Our binary128 Rgemm accelerated by FPGAs
engineering [28], finance [29], quantum chemistry [30], and effectively accelerates the PDIPM for SDP problems.
physics [31], which have been investigated for a long time.
SDPA [32] is a numerical implementation and software TABLE IV
package for SDP written in C++ [33]. The algorithm used in E LAPSED TIME PER ITERATION IN SEC . OF THE SDPA ON CPU S AND
AGILEX FPGA
the SDPA is called the PDIPM, one of the iteration methods for
Problem CPU-A CPU-B FPGA(Agilex)
SDP. Previous research [5] has extended the SDPA to support
theta2 0.8 0.45 0.42
various precision FP operations such as SDPA-GMP, -DD, and theta3 4.99 2.68 2.11
QD [5]. The GMP version uses arbitrary precision arithmetic. theta4 21.17 10.24 7.28
Thus, a user must specify the precision beforehand. These theta5 69.35 30.82 20.17
extended versions of the SDPA use a part of MPLAPACK [11] theta6 191.4 79.54 48.3
as a back-end, mainly through calling the Rgemm routine. control11 66.92 38.09 28.51
equalG51 141.04 66.87 33.32
To determine which parameters are utilized in GEMM rou- gpp500-1 18.45 8.53 4.47
tines called from the SDPA, we conduct 92 problems provided gpp500-2 18.58 8.53 4.56
by SDPLIB [34] using SDPA-binary128 with MPLAPACK. maxG11 53.35 25.66 16.39
As we are currently focusing on accelerating GEMM routines maxG32 803.42 380.92 232.81
maxG51 108.69 53.34 34.19
in our work, we have modified the code to record the 13
mpc500-1 11.49 5.39 3.36
arguments specified in Listing 1 for the Rgemm routine during mpc500-4 14.37 7.31 4.9
the execution of all problems. qpG11 264.96 111.95 64.48
Analysis of the collected data reveals that the SDPA fre- qpG51 480.79 207.78 120.98
quently calls the Rgemm routine with non-square matrices, thetaG11 82.11 41.7 28.55
thetaG51 853.03 387.67 248.86
and none of the leading dimensions of the matrices in the
Rgemm routine equal m, n, or k. Of the over 800 combinations
of arguments recorded in the collected data, we find only 50
combinations where the condition n = m = k = lda = ldb = C. Discussions on Application Performance
ldc holds. As shown in Sec. IV-B1, the performance of our The blocked LU decomposition algorithm Rgetrf outlined
binary128 GEMM designs on FPGAs for non-square matrices in Sec. V-A employs the Rgemm operation to compute A22 =
is inferior to that for square matrices. L21 U12 , where both matrices are non-square and skinny.
Based on our analysis, we evaluate the performance of the L21 and U12 are matrices of dimensions b × k and k × b,
SDPA calling Rgemm operation accelerated by an FPGA only respectively. During the loop from step 2 to step 6, k is
when either two conditions are satisfied; (1) m equals n or (2) reduced as k = n−pb, where p represents the iteration number
m × n × k is larger than a predefined parameter Nmin = 106 . starting from p = 1. At an initial phase of the algorithm, k is
TABLE V 100x faster than the reference Rgemm on a 10-core CPU, the
T HE RELATIVE GAPS , PRIMAL / DUAL FEASIBLE ERRORS , AND THE two applications evaluated in this section are not substantially
NUMBER OF ITERATIONS FOR CERTAIN PROBLEMS FROM SDPLIB WERE
CALCULATED ON CPU-B USING BINARY 128, FPGA (AGILEX , accelerated by the FPGA. Therefore, to make our binary128
BINARY 128), THE DOUBLE - DOUBLE PRECISION VERSION (DD), AND THE GEMM designs on FPGAs more practical for real-world
DOUBLE PRECISION VERSION . applications, we will need to extensively modify the systolic
Problem CPU-B FPGA DD double array design generated by FBLAS to address the performance
theta2 degradation for small matrices and non-square matrices. A
relative gap 1.05e-24 1.16e-24 2.68e-25 1.45e-08 potential solution is to develop an extended version of Rgemm
p.feas.error 3.70e-32 7.70e-33 4.93e-31 3.55e-15 that incorporates another level of blocking in the host code.
d.feas.error 2.14e-25 2.89e-25 4.51e-27 5.77e-15
# of iterations 58 62 51 17
Specifically, we could develop a new Rgemm API based on a
theta3 batched GEMM algorithm [35]. It would allow us to instantiate
relative gap 5.71e-25 5.35e-24 1.86e-23 1.57e-08 multiple systolic arrays on an FPGA to handle the batched
p.feas.error 1.08e-32 1.23e-32 9.86e-31 8.88e-15 GEMM algorithm. One of a hardware implementation of a
d.feas.error 2.43e-26 6.40e-25 1.42e-24 4.00e-15 batched GEMM algorithm focusing on 64 and smaller bits of
# of iterations 55 50 61 17
FP numbers was reported by Ledoux et al. [36]. Their systolic
theta4
relative gap 5.15e-25 5.89e-25 6.18e-27 2.25e-08 array design leverages a stalling-free output scheme for the
p.feas.error 2.62e-32 1.85e-32 7.89e-31 7.11e-15 output matrix C to maximize the overlap of host data transfers
d.feas.error 3.51e-26 9.35e-26 5.06e-28 1.47e-14 with GEMM computations.
# of iterations 71 94 52 18
theta6 VI. C ONCLUSION
relative gap 6.90e-31 1.23e-30 6.28e-25 2.45e-08
p.feas.error 1.39e-32 1.85e-32 8.87e-31 1.42e-14
In this paper, we presented our binary128 GEMM imple-
d.feas.error 5.74e-32 5.80e-32 5.19e-26 5.04e-14 mentation and its evaluation of different Intel FPGAs, and
# of iterations 45 46 54 18 its integration into numerical applications such as blocked
control11 LU decomposition and SDP. Our GEMM designs on FPGAs
relative gap 9.01e-25 1.87e-23 2.24e-22 8.26e-06 are based on the 2-D systolic array generated by the FBLAS
p.feas.error 1.62e-27 2.02e-27 2.41e-25 1.86e-09
library. Furthermore, by optimizing memory buffer size, which
d.feas.error 1.60e-24 4.51e-24 1.50e-22 2.03e-07
# of iterations 64 62 60 47 stores reused data in fast on-chip memory, we successfully
implemented 8×16 PEs to accelerate the GEMM in binary128
arithmetic on FPGAs.
The benchmarking in this paper showed that our imple-
large enough such that our binary128 GEMM designs on the mentation is particularly advantageous when computing large
Agilex FPGA effectively accelerate the performance of Rgetrf. matrices of size n > 104 . For example, in our evaluation of
However, as k becomes much smaller than n at a later phase of our binary128 GEMM implementation on the Agilex FPGA,
the algorithm, the acceleration by the Agilex FPGA becomes the performance was 90.9GFlops, 91% of the estimated peak
ineffective. The blocking size b also impacts the performance performance of the design. This resulted in a 147x speed-up
of the GEMM on FPGAs. For instance, if b is too small, the compared to the Rgemm routine provided by MPLAPACK on
performance of Rgemm on FPGAs is significantly reduced, as an i9-10900 CPU with 20 threads.
depicted in Figs. 2 and 5. Further benchmarking of various matrix multiplications
On the other hand, the PDIPM frequently calls the Rgemm showed that our designs are pretty effective to accelerate
operation for small non-square matrices with a wide range GEMM operations for square and almost-square matrices. In
of combinations of matrix sizes n, k, and m. The largest other words, LU decomposition can be solved faster using
matrix size in all problems presented in Table IV is only our implementation than with existing CPU routines. However,
n = k = m = 2000. With a matrix size of n = k = m = our design was not effective at handling tall-skinny matrices,
2000, the performance of Rgemm on FPGAs is half the peak commonly found to solve semidefinite programming.
performance. In most cases, the algorithm calls the Rgemm Our current systolic array designs for GEMM operations
operation for much smaller matrices when it is not executed are based on the OpenCL kernels generated by the latest
on the FPGA. In a previous evaluation of a fast GEMM in DD version of FBLAS [37]. The FBLAS is designed to be flexible
arithmetic on GPUs by Nakata et al. [15], it was shown that and accommodate various kernel configurations for different
the performance of the PDIPM in DD arithmetic accelerated BLAS routines, such as General Matrix-Vector Multiplication
by a GPU is more than 10x faster than that on a CPU with (GEMV) and Triangular Solve with Multiple Right-Hand
four cores. According to their results, the size of matrices does Sides (TRSM). However, in this study, we extracted only the
not significantly affect the performance of Rgemm on GPU. systolic array kernels of GEMM for our work. Extending our
Therefore, they have always utilized the GPU, except for very work to other BLAS routines would be an interesting area for
small matrices. future research.
Despite the superior performance of our accelerated Rgemm There is still room for optimization to improve the per-
implementation on the Agilex FPGA, which is more than formance of our GEMM design when we use it to calcu-
late tall-skinny matrix multiplications. Further optimizations [13] T. Dekker, “A Floating-Point Technique for Extending the Available
are necessary to achieve the desired performance, especially Precision,” Numerische Mathematik, vol. 18, pp. 224–242, 1971.
[14] D. Kunuth, The Art of Computer Programming vol.2 Seminumerical
for SDP problems. In future work, we will compare such Algorithms, 1st ed. Reading, Massachusetts: Addison Wesley, 1998.
optimized GEMM designs with other high-precision GEMM [15] M. Nakata, Y. Takao, S. Noda, and R. Himeno, “A fast implementation
implementations on accelerators. Another area of future work of matrix-matrix product in double-double precision on nvidia c2050 and
application to semidefinite programming,” in 2012 Third International
will be to explore other FP formats in our GEMM designs Conference on Networking and Computing, 2012, pp. 68–75.
by replacing the current binary128 multiply-add units with [16] T. Kouya, “Acceleration of lu decomposition supporting double-double,
multiply-add units in different arithmetic. triple-double, and quadruple-double precision floating-point arithmetic
with avx2,” in 2021 IEEE 28th Symposium on Computer Arithmetic
(ARITH), 2021, pp. 54–61.
ACKNOWLEDGMENT [17] M. Joldes, J.-M. Muller, V. Popescu, and W. Tucker, “Campary: Cuda
multiple precision arithmetic library and applications,” in Mathemat-
A part of this paper is based on results obtained from a ical Software – ICMS 2016, G.-M. Greuel, T. Koch, P. Paule, and
project, JPNP16007, commissioned by the New Energy and A. Sommese, Eds. Cham: Springer International Publishing, 2016,
Industrial Technology Development Organization (NEDO). pp. 232–240.
[18] K. Isupov and V. Knyazkov, “Multiple-precision blas library for graphics
This work was partly supported by MEXT as ”Feasibility processing units,” in Supercomputing, V. Voevodin and S. Sobolev, Eds.
studies for the next-generation computing infrastructure” and Cham: Springer International Publishing, 2020, pp. 37–49.
KAKENHI Grant Number JP23K11133. [19] I. Flores, “Residue arithmetic and its application to computer technology
(nicholas s. szabo and richard i. tanaka),” SIAM Review, vol. 11, no. 1,
This research in part used computational resources of pp. 103–104, 1969. [Online]. Available: https://doi.org/10.1137/1011027
Cygnus provided by Multidisciplinary Cooperative Research [20] T. Nakayama and D. Takahashi, “Implementation of multiple-precision
Program in Center for Computational Sciences, University of floating-point arithmetic library for gpu computing,” in Proceedings of
the 23rd IASTED International Conference on Parallel and Distributed
Tsukuba. Computing and Systems (Dallas, USA, December 2011), A. ACTA
We thank Prof. Ishikawa, High Energy Accelerator Re- Press, 2011, pp. 343–349.
search Organization, and Prof. Daisaka, Hitotsubashi Univer- [21] D. Mukunoki, K. Ozaki, T. Ogita, and T. Imamura, “Accurate matrix
multiplication on binary128 format accelerated by ozaki scheme,” in
sity, Japan, for their help evaluating our designs on Stratix10. 50th International Conference on Parallel Processing, ser. ICPP 2021.
New York, NY, USA: Association for Computing Machinery, 2021.
R EFERENCES [Online]. Available: https://doi.org/10.1145/3472456.3472493
[22] K. Ozaki, T. Ogita, S. Oishi, and S. Rump, “Error-free transformations
[1] “Ieee standard for floating-point arithmetic,” IEEE Std 754-2019 (Revi- of matrix multiplication by using fast routines of matrix multiplication
sion of IEEE 754-2008), pp. 1–84, 2019. and its applications,” Numerical Algorithms, vol. 59, no. 1, pp. 95–118,
[2] T. Norrie, N. Patil, D. H. Yoon, G. Kurian, S. Li, J. Laudon, C. Young, Jan. 2012, copyright: Copyright 2011 Elsevier B.V., All rights reserved.
N. Jouppi, and D. Patterson, “The design process for google’s training [23] J. de Fine Licht, G. Kwasniewski, and T. Hoefler, “Flexible commu-
chips: Tpuv2 and tpuv3,” IEEE Micro, vol. 41, no. 2, pp. 56–63, 2021. nication avoiding matrix multiplication on fpga with high-level syn-
[3] L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM thesis,” in The 2020 ACM/SIGDA International Symposium on Field-
Review, vol. 38, no. 1, pp. 49–95, 1996. [Online]. Available: Programmable Gate Arrays (FPGA’20), 2020.
https://doi.org/10.1137/1038003 [24] J. de Fine Licht, C. A. Pattison, A. N. Ziogas, D. Simmons-Duffin, and
[4] F. Alizadeh, J.-P. A. Haeberly, and M. L. Overton, “Complementarity T. Hoefler, “Fast arbitrary precision floating point on fpga,” in 2022 IEEE
and nondegeneracy in semidefinite programming,” Mathematical 30th Annual International Symposium on Field-Programmable Custom
Programming, vol. 77, pp. 111–128, 1997. [Online]. Available: Computing Machines (FCCM), 2022, pp. 1–9.
https://doi.org/10.1007/BF02614432 [25] L. Fousse, G. Hanrot, V. Lefèvre, P. Pélissier, and P. Zimmermann,
[5] M. Nakata, “A numerical evaluation of highly accurate multiple- “Mpfr: A multiple-precision binary floating-point library with correct
precision arithmetic version of semidefinite programming solver: Sdpa- rounding,” ACM Trans. Math. Softw., vol. 33, no. 2, pp. 13–es, jun
gmp, -qd and -dd.” in 2010 IEEE International Symposium on Computer- 2007. [Online]. Available: https://doi.org/10.1145/1236463.1236468
Aided Control System Design, 2010, pp. 29–34. [26] “Information technology — programming languages, their environ-
[6] C. Lichtenau, S. Carlough, and S. M. Mueller, “Quad precision floating ments, and system software interfaces — floating-point extensions for c
point on the ibm z13,” in 2016 IEEE 23nd Symposium on Computer — part 3: Interchange and extended types,” International Organization
Arithmetic (ARITH), 2016, pp. 87–94. for Standardization, Geneva, CH, Tech. Rep. ISO/IEC TS 18661-3:2015,
[7] K. Nagasu, K. Sano, F. Kono, and N. Nakasato, “Fpga-based tsunami 2015.
simulation: Performance comparison with gpus, and roofline model for [27] S. Blackford and J. Dongarra, “Lapack working note 41 installation
scalability analysis,” Journal of Parallel and Distributed Computing, vol. guide for lapack 1,” 01 1999.
106, pp. 153–169, Aug. 2017, publisher Copyright: © 2016 Elsevier Inc. [28] M. Ohsaki, K. Fujisawa, N. Katoh, and Y. Kanno, “Semi-definite pro-
[8] H. Kung, C. Leiserson, C.-M. U. P. P. D. of COMPUTER SCIENCE., gramming for topology optimization of trusses under multiple eigenvalue
and C. M. U. C. S. Department, Systolic Arrays for (VLSI), ser. constraints,” Computer Methods in Applied Mechanics and Engineering,
CMU-CS. Carnegie-Mellon University, Department of Computer vol. 180, pp. 203–217, 1999.
Science, 1978. [Online]. Available: https://books.google.co.jp/books? [29] A. Gepp, G. Harris, and B. Vanstone, “Financial applications of semidef-
id=pAKfHAAACAAJ inite programming: a review and call for interdisciplinary research,”
[9] T. De Matteis, J. de Fine Licht, and T. Hoefler, “Fblas: Streaming linear Accounting and Finance, vol. 60, 09 2019.
algebra on fpga,” in Proceedings of the International Conference for [30] M. Fukuda, B. Braams, M. Nakata, M. Overton, J. Percus, M. Yamashita,
High Performance Computing, Networking, Storage and Analysis, ser. and Z. Zhao, “Large-scale semidefinite programs in electronic structure
SC ’20. IEEE Press, 2020. calculation,” Math. Program., vol. 109, pp. 553–580, 03 2007.
[10] N. Nakasato, H. Daisaka, and T. Ishikawa, “High performance high- [31] D. Poland, S. Rychkov, and A. Vichi, “The conformal bootstrap:
precision floating-point operations on fpgas using opencl,” in 2018 Theory, numerical techniques, and applications,” Rev. Mod. Phys.,
International Conference on Field-Programmable Technology (FPT), vol. 91, p. 015002, Jan 2019. [Online]. Available: https://link.aps.org/
2018, pp. 262–265. doi/10.1103/RevModPhys.91.015002
[11] M. Nakata, “Mplapack version 2.0.1 user manual,” 2022. [32] M. Yamashita, K. Fujisawa, M. Fukuda, K. Kobayashi, K. Nakata,
[12] N. Nakasato, “A fast gemm implementation on the cypress gpu,” and M. Nakata, Latest Developments in the SDPA Family for Solving
SIGMETRICS Performance Evaluation Review, vol. 38, pp. 50–55, 03 Large-Scale SDPs. Boston, MA: Springer US, 2012, pp. 687–713.
2011. [Online]. Available: https://doi.org/10.1007/978-1-4614-0769-0 24
[33] “Sdpa(semidefinite programming algorithms) official page.” [Online].
Available: http://sdpa.sourceforge.net/
[34] “Sdplib 1.2.” [Online]. Available: https://github.com/vsdp/SDPLIB/
[35] A. Haidar, T. Dong, P. Luszczek, S. Tomov, and J. Dongarra,
“Optimization for performance and energy for batched matrix
computations on gpus,” in Proceedings of the 8th Workshop on General
Purpose Processing Using GPUs, ser. GPGPU-8. New York, NY,
USA: Association for Computing Machinery, 2015, p. 59–69. [Online].
Available: https://doi.org/10.1145/2716282.2716288
[36] L. Ledoux and M. Casas, “A generator of numerically-tailored and high-
throughput accelerators for batched gemms,” in 2022 IEEE 30th Annual
International Symposium on Field-Programmable Custom Computing
Machines (FCCM), 2022, pp. 1–10.
[37] “Fblas.” [Online]. Available: https://github.com/spcl/FBLAS/

Tesla Dojo Technology
89% (9)
Tesla Dojo Technology
9 pages
Books List of Alahazrat by Com
100% (1)
Books List of Alahazrat by Com
153 pages
Modern-Cpp Federico Busato 20240318
No ratings yet
Modern-Cpp Federico Busato 20240318
1,529 pages
Design and Implementation of A 32 Bit RISC Processor On Xilinx FPGA
No ratings yet
Design and Implementation of A 32 Bit RISC Processor On Xilinx FPGA
6 pages
SWF File Format Spec v10
100% (2)
SWF File Format Spec v10
278 pages
Riscv Bfloat16
No ratings yet
Riscv Bfloat16
20 pages
Floating-Point Multiplication Unit With 16-Bit Significant and 8-Bit Exponent
No ratings yet
Floating-Point Multiplication Unit With 16-Bit Significant and 8-Bit Exponent
6 pages
Fulltext
No ratings yet
Fulltext
145 pages
Sourcery G++ Lite 2008q3-72 Getting-Started
No ratings yet
Sourcery G++ Lite 2008q3-72 Getting-Started
62 pages
Julia Documentation
No ratings yet
Julia Documentation
611 pages
Low-Power Multiple-Precision Iterative Floating-Point Multiplier With SIMD Support
No ratings yet
Low-Power Multiple-Precision Iterative Floating-Point Multiplier With SIMD Support
13 pages
COA in 6 Hours
No ratings yet
COA in 6 Hours
312 pages
Hybrid FP FXP Dot Product
No ratings yet
Hybrid FP FXP Dot Product
12 pages
Project Report Vlsi
No ratings yet
Project Report Vlsi
33 pages
ARM C Language Extensions - Al Grant
No ratings yet
ARM C Language Extensions - Al Grant
81 pages
Design of FPGA Based 32-Bit Floating Point Arithmetic Unit and Verification of Its VHDL Code Using MATLAB
No ratings yet
Design of FPGA Based 32-Bit Floating Point Arithmetic Unit and Verification of Its VHDL Code Using MATLAB
14 pages
Aapcs 64
No ratings yet
Aapcs 64
61 pages
ORNL Tensor Core Training Aug2019
No ratings yet
ORNL Tensor Core Training Aug2019
113 pages
2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators
No ratings yet
2024 - Intel - Tech Tour TW - Lunar Lake AI Hardware Accelerators
60 pages
PFG 21 23
No ratings yet
PFG 21 23
35 pages
An Efficient Floating Point Adder For Low-Power Devices
No ratings yet
An Efficient Floating Point Adder For Low-Power Devices
9 pages
Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier
No ratings yet
Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier
5 pages
Design and Implementation of An Optimized Double Precision Floating Point Divider On FPGA
No ratings yet
Design and Implementation of An Optimized Double Precision Floating Point Divider On FPGA
8 pages
Procedure Call Standard For The Arm Architecture
No ratings yet
Procedure Call Standard For The Arm Architecture
43 pages
DSD - Oe - Project - PPT (1) - e Div
No ratings yet
DSD - Oe - Project - PPT (1) - e Div
26 pages
Implementation of Optimized Floating Point Adder On FPGA
No ratings yet
Implementation of Optimized Floating Point Adder On FPGA
6 pages
ISSCC2020-01 Digest
No ratings yet
ISSCC2020-01 Digest
34 pages
High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms
No ratings yet
High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms
10 pages
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
7 pages
2174 PDF
No ratings yet
2174 PDF
7 pages
Chapter 3 - Exercies
No ratings yet
Chapter 3 - Exercies
5 pages
Algorithms 14 00198
No ratings yet
Algorithms 14 00198
21 pages
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
No ratings yet
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
7 pages
Infineon-VX-toolset For ADS User Guide Limited-Software-V01 00-EN
No ratings yet
Infineon-VX-toolset For ADS User Guide Limited-Software-V01 00-EN
25 pages
Design and Implementation of Floating Point ALU With Parity Generator Using Verilog HDL
No ratings yet
Design and Implementation of Floating Point ALU With Parity Generator Using Verilog HDL
6 pages
Ullah 2021
No ratings yet
Ullah 2021
14 pages
FPDSP Latest
No ratings yet
FPDSP Latest
14 pages
Fang F. - Lightweight Floating-Point Arithmetic - Case Study of IDCT
No ratings yet
Fang F. - Lightweight Floating-Point Arithmetic - Case Study of IDCT
13 pages
A High-Speed and Low-Complexity Architecture For Softmax Function in Deep Learning
No ratings yet
A High-Speed and Low-Complexity Architecture For Softmax Function in Deep Learning
4 pages
Transformer-Lite - High-Efficiency LLMs On Mobile
No ratings yet
Transformer-Lite - High-Efficiency LLMs On Mobile
21 pages
MS Thesis Radha Gulhane
No ratings yet
MS Thesis Radha Gulhane
48 pages
3642 GLM 130b An Open Bilingual Pre
No ratings yet
3642 GLM 130b An Open Bilingual Pre
56 pages
Area-Efficient Iterative Logarithmic Approximate Multipliers For IEEE 754 and Posit Numbers
No ratings yet
Area-Efficient Iterative Logarithmic Approximate Multipliers For IEEE 754 and Posit Numbers
13 pages
Hamamu ASAP 2020 Jun9
No ratings yet
Hamamu ASAP 2020 Jun9
8 pages
ENSC254 - Floating Point Computation
No ratings yet
ENSC254 - Floating Point Computation
29 pages
Floating Point Elsevier
No ratings yet
Floating Point Elsevier
12 pages
Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications
No ratings yet
Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications
12 pages
Optimizing Large Language Model Training Using FP4 Quantization
No ratings yet
Optimizing Large Language Model Training Using FP4 Quantization
17 pages
Pytorch Fundamentals
No ratings yet
Pytorch Fundamentals
23 pages
Systolic Array
No ratings yet
Systolic Array
9 pages
Sheng 2020 IOP Conf. Ser. Mater. Sci. Eng. 932 012059
No ratings yet
Sheng 2020 IOP Conf. Ser. Mater. Sci. Eng. 932 012059
12 pages
Blending DSP and ML Features Into A Low Power General Purpose Processor
No ratings yet
Blending DSP and ML Features Into A Low Power General Purpose Processor
21 pages
16-Bit Floating Point Instructions For Embedded Multimedia Applications
No ratings yet
16-Bit Floating Point Instructions For Embedded Multimedia Applications
6 pages
5 HISPE - High-Speed - Configurable - Floating-Point - Multi-Precision - Processing - Element
No ratings yet
5 HISPE - High-Speed - Configurable - Floating-Point - Multi-Precision - Processing - Element
8 pages
Energy-Efficient Algebra Kernels in FPGA For High Performance Computing
No ratings yet
Energy-Efficient Algebra Kernels in FPGA For High Performance Computing
13 pages
High-Performance Accurate and Approximate Multipliers For FPGA-Based Hardware Accelerators
No ratings yet
High-Performance Accurate and Approximate Multipliers For FPGA-Based Hardware Accelerators
14 pages
Floating-Point Number of Extreme Cases
No ratings yet
Floating-Point Number of Extreme Cases
27 pages
Motuner A Compiler-Based Auto-Tuning Approach For Mixed-Precision Operators
No ratings yet
Motuner A Compiler-Based Auto-Tuning Approach For Mixed-Precision Operators
9 pages
31 Design JJ New
No ratings yet
31 Design JJ New
8 pages
Design and Implementation of FPU For Optimised Speed: R. Bhuvanapriya, Menakadevi T
No ratings yet
Design and Implementation of FPU For Optimised Speed: R. Bhuvanapriya, Menakadevi T
12 pages
Synopsis and Literature Survey
No ratings yet
Synopsis and Literature Survey
10 pages
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
No ratings yet
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
8 pages
TS Chip
No ratings yet
TS Chip
9 pages
34.8 A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro With 31.2TFLOPS W For AI Edge Devices
No ratings yet
34.8 A 22nm 16Mb Floating-Point ReRAM Compute-in-Memory Macro With 31.2TFLOPS W For AI Edge Devices
3 pages
Design of Low-Area and High Speed Pipelined
No ratings yet
Design of Low-Area and High Speed Pipelined
6 pages
Torch Optimization
No ratings yet
Torch Optimization
17 pages
Floating Point Arithmetic Unit With Multi-Precision For DSP Applications
No ratings yet
Floating Point Arithmetic Unit With Multi-Precision For DSP Applications
8 pages
Efficient Implementation of Pipelined Double Precision Floating Point Unit On FPGA
No ratings yet
Efficient Implementation of Pipelined Double Precision Floating Point Unit On FPGA
6 pages
VLSI Implementation of Bit Serial Architecture Based Multiplier in Floating Point Arithmetic
No ratings yet
VLSI Implementation of Bit Serial Architecture Based Multiplier in Floating Point Arithmetic
6 pages
Floating-Point Hardware Design A Test Perspective
No ratings yet
Floating-Point Hardware Design A Test Perspective
5 pages
Area Efficient Floating-Point Adder and Multiplier With IEEE-754 Compatible Semantics
No ratings yet
Area Efficient Floating-Point Adder and Multiplier With IEEE-754 Compatible Semantics
8 pages
Ijspr 5901 30318
No ratings yet
Ijspr 5901 30318
5 pages
Shi Wal 95 A
No ratings yet
Shi Wal 95 A
8 pages
Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas
No ratings yet
Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas
5 pages
Design and Synthesizing of Floating Point Adder Andmultiplier Using Cadence RTL Compiler
No ratings yet
Design and Synthesizing of Floating Point Adder Andmultiplier Using Cadence RTL Compiler
6 pages
Floating 2
No ratings yet
Floating 2
5 pages
4c MSFP
No ratings yet
4c MSFP
11 pages
Implementation of Approximate Half Precision Floating Point Multiplier Using Verilog
No ratings yet
Implementation of Approximate Half Precision Floating Point Multiplier Using Verilog
3 pages
Sage Attention 2
No ratings yet
Sage Attention 2
9 pages
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
No ratings yet
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
4 pages
Ijspr 1203 438
No ratings yet
Ijspr 1203 438
4 pages
A High Performance and Full Utilization Hardware Implementation of Floating Point Arithmetic Units
No ratings yet
A High Performance and Full Utilization Hardware Implementation of Floating Point Arithmetic Units
4 pages
Accelerating Inference For High Resolution Images With Quantization and Distributed Deep Learning
No ratings yet
Accelerating Inference For High Resolution Images With Quantization and Distributed Deep Learning
8 pages
XGBoost GPU Implementation and Optimization: The Complete Guide for Developers and Engineers
From Everand
XGBoost GPU Implementation and Optimization: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
From Everand
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
Robert Johnson
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

Accelerating 128-Bit Floating-Point Matrix Multiplication On Fpgas

Uploaded by

Accelerating 128-Bit Floating-Point Matrix Multiplication On Fpgas

Uploaded by

Accelerating 128-bit Floating-Point Matrix

binary128, are also required by specific applications. One ex-

IV. P ERFORMANCE E VALUATION

PR × PC 2×2 4×4 8×8

FPGA Stratix10 Agilex

Fig. 4. Performance of our binary128 GEMM designs on Arria10 FPGA for

However, the multiplication on the design of 8 × 8 PEs

You might also like