Accelerating 128-Bit Floating-Point Matrix Multiplication On Fpgas
Accelerating 128-Bit Floating-Point Matrix Multiplication On Fpgas
Multiplication on FPGAs
Fumiya Kono∗§ , Naohito Nakasato† , Maho Nakata‡
∗ Shizuoka Institute of Science and Technology, Fukuroi, Shizuoka, JAPAN
† The University of Aizu, Aizuwakamatsu, Fukushima, JAPAN
‡ Cluster for Pioneering Research, RIKEN, Wako, Saitama, JAPAN
§ [email protected]
Abstract—General Matrix Multiplication (GEMM) is a fun- On the other hand, operations with higher precision, such as
damental operation widely used in scientific computations. Its
arXiv:2306.04087v1 [cs.DC] 7 Jun 2023
with sizes m × k and k × n, respectively. Then, an element of fully unrolling the loop, the main kernel defines the systolic
the resulting matrix C ′ = AB is computed by the summation array. Because the computation task of a PE is just a multiply-
as follows: add operation, we can replace the multiply-add operation in
k−1
′
X the original design with any multiply-add unit for a desired
Cij = Aip × Bpj , (2)
FP format. This enables us to create a systolic array design
p=0
corresponding to the designated precision.
where i, j, and p are indices ranging 0 ≤ i < m, 0 ≤ j < n, In addition, to replace the multiply-add operation, we mod-
and 0 ≤ p < k, respectively. The calculation of the whole ify and extend the other three kernels for the Read, Feed, and
matrix C ′ involves a 3-level nested loop. Store modules to support a wider memory bus for binary128
arithmetic. We also extend the original kernels to optimize
load and store operations from DRAM. The Read and Feed
kernels are equipped with a memory buffer in front of the
Feed module. In the original design, the memory buffer is
called a memory tile and explicitly instantiated as a 1-D
array. The memory tile acts as a cache memory to store
a sub-matrix of A and reuse the sub-matrix many times. The
exploitation of the memory tile reduces the pressure on the
memory bandwidth of DRAM and improves the performance
of our binary128 GEMM designs, as shown in the later section.
The number of PEs in the present systolic array is PR ×PC .
We instantiate PR × PC binary128 multiply-add units. The
additional computations in the definition of the GEMM, as
shown in Eq. (1), require the computation of two scaler-matrix
multiplications and one matrix addition, which are very costly
Fig. 1. Systolic Array Design for the GEMM operation
in a GEMM design on an FPGA. In the present systolic array,
Fig. 1 illustrates the design of a systolic array for our we need additional PC multiply units for αA, a load unit for C,
binary128 GEMM design derived from FBLAS [9]. This PC multiply units for βC, and PC add units for the summation
design is characterized by a 2-D array of processing elements of αA and βC. Except for the multiply units for αA, which
(PE) aligned PC ×PR . Each PE calculates Eq. (2) for assigned can be merged with the Feed module, the other units are
sub-matrices of A and B. The size of the sub-matrices A and only activated in the final stage of the GEMM operation
B and the value of PC ×PR determine how the input matrices at the Store module. Therefore, in this research, we only
are partitioned. calculate Eq. (2) on an FPGA, while the host CPU handles the
In the computation flow, the input matrices A and B are read transpose operations and other additional operations involving
from main memory via the Read module and sent to the PEs α and β. Supporting those additional operations, we develop
through the Feed module. A is sent by column, and B is sent an API that is compatible with the standard Rgemm provided
by row, assuming that both matrices are not transposed. They by MPLAPACK. It enables us to use our binary128 GEMM
are first received by PEs with IDs (PR − 1, 0) or (0, PC − 1) designs immediately in numerical applications with minimal
and forwarded to the adjacent PEs in the systolic array on each changes.
clock cycle. Each PE accumulates the result of a multiply-
add operation for the same element in C ′ and sends it to the B. Performance Models
Drain module, which is eventually collected by the Store
module to be written back to the main memory. Here, we summarize the performance models for our bi-
More specifically, FBLAS is a generator of OpenCL kernels nary128 GEMM design. In this section, f represents the clock
for the systolic array. The generated systolic array consists of frequency of the logic circuit design in MHz.
four OpenCL kernels: two kernels that combine the Read and 1) Performance of GEMM: The peak performance of the
Feed modules for A and B, one Store kernel for C, and designs depends on the layout of systolic arrays, as shown in
a main kernel for the array of PEs and Drain module. The Fig. 1. When we use PR × PC PEs, the peak performance
main kernel explicitly calls a function for one PE in a loop. By Fpeak (GFlops) is given by Eq.(3).
Eq. (6), we calculate the average L1 norm of the difference
2 × PR × PC × f × 106 between two n × n matrices as EL1 throughout the evaluation.
Fpeak = (3)
109
n−1
X n−1
The measured performance Fperf of the designs in GFlops
X
F R
Cij − Cij
is calculated by Eq. (4), where Texec is the execution time in i=0 j=0
seconds. EL1 = , (6)
n2
2mnk In Eq. (6), C F and C R denote the result matrices by our
Fperf = (4) implementation for FPGAs and Rgemm, respectively. EL1
Texec × 109
allows us to determine how accurately our binary128 GEMM
In Eq. (2), m, n and k denote the matrix size parameters. designs match the results of the reference implementation.
For the multiplication of n × n square matrices, the number To highlight the main characteristics of computational per-
of FP operations is 2n3 . formance, we begin by evaluating the designs on the Arria10
2) Memory Bandwidth Requirement: The performance of FPGA in this section. The following section covers the perfor-
the designs is also affected by memory bandwidth of an FPGA mance evaluation of the designs on newer FPGAs, including
board. PR × PC systolic array takes PR + PC inputs conveyed Stratix10 and Agilex.
by two vertical and horizontal Feed pipelines at every cycle.
Thus, the required memory bandwidth Breq (GB/s) is given B. Benchmarking Results on Arria10
by Eq. (5). 1) Evaluation for Square Matrices: We present benchmark-
ing results for our binary128 GEMM designs. The systolic ar-
(PR + PC ) × f × 106 × NByte ray consists of PEs arranged in a square with PR = PC = 2, 4,
Breq = (5)
109 and 8. Table II shows the logic synthesis results on the Arria10
NByte represents the word size established as 16 bytes in the FPGA system.
present work. If the systolic array consists of 8 × 8 PEs, Breq Our binary128 GEMM design requires more DSP blocks
equals 256f × 10−3 GB/s. For example, the requirement Breq for larger PE arrays. Therefore, the number of available DSP
becomes 51.2GB/s for the design where the clock frequency blocks is the primary constraint for the design. The row labeled
f is 200MHz. To fully utilize all PEs in the designs, Breq Fmax shows the clock frequency of each design. Therefore,
must be smaller than the memory bandwidth of a target FPGA their peak performance Fpeak is shown in the last row based
board. on Eq. (3).
A. Benchmarking Conditions
1) Target FPGA Systems: Table I shows the specification of
FPGAs used in this benchmarking: Terasic DE5a-Net Arria10,
Nallatech (BittWare) 520N Stratix10, and Terasic DE10a-Net
Agilex. The Stratix10 FPGA is a computation node of Cygnus,
a supercomputer system operating at the University of Tsukuba
in Japan since 2019. We use Intel FPGA SDK for OpenCL
to design and implement our binary128 GEMM designs. A
different host system hosts each FPGA as specified in the
bottom rows of Table I.
Fig. 2. Performance of our binary128 GEMM designs for square matrices
2) Evaluation Method: We first evaluate our binary128 on Arria10 FPGA
GEMM designs for square matrices by scaling n. Also, we
evaluate the performance of multiplying non-square matrices Fig. 2 shows the performance of each design on Arria10.
with sizes m × k and k × n as more realistic and practical The matrix size n ranges from 64 to 4096. The performance of
evaluations. To calculate the performance in GFlops, Eqs. (3) designs Fperf with 2×2, 4×4, and 8×8 PEs is at a maximum
and (4) are used. The computation time Texec in Eq. (4) is of 1.88, 7.1, and 15.0GFlops, respectively. Since each PE can
the average of three trials in each benchmarking. As a target work independently for data streaming and operations on the
of comparison, we use a baseline of the Rgemm executed on systolic array, the performance is proportional to the number
the host system of Agilex (i9-10900 CPU) with 20 threads by of PEs in the design.
OpenMP parallelization. However, with a small n, the computation load for each PE
Besides, we compare numerical accuracy with the Rgemm is not sufficiently high to reach the maximum performance
routine provided by MPLAPACK on a CPU. As shown in of the designs. It reaches the peak at a specific n, such as
TABLE I
S PECIFICATION OF FPGA SYSTEMS IN OUR PERFORMANCE EVALUATION
FPGA Arria10 Stratix10 Agilex
Logic cells 427,200 933,120 487,200
DSP blocks 1,518 5,760 4,510
M20K RAM blocks 2,713 11,721 7,110
Memory bits (total) 55,562,240 240,046,080 145,612,800
Board Memory 2× DDR3-1066 8GB 4× DDR4-2400 8GB 4× DDR4-2666 8GB
Board Memory Bandwidth 34.2GB/s 76.8GB/s 85.2GB/s
PCIe Gen2 x8 Gen3 x8 Gen3 x16
Quartus ver. 19.1 20.4 21.1
CPU i7-2600K Xeon Gold 6226 i9-10900
Ncore /Nthread 4/8 16/32 10/20
Host memory size 16GB 192 GB 64 GB
Host OS Ubuntu 18.04.6LTS CentOS 7.9.2009 Ubuntu 20.04.5LTS
gcc ver. 8.4.0 7.4.0 9.4.0
TABLE II
S YNTHESIS RESULTS OF EACH PR × PC DESIGNS ON A RRIA 10 FPGA
TABLE III
S YNTHESIS RESULTS OF EACH PR × PC DESIGNS ON S TRATIX 10 AND AGILEX FPGA S
n = 2048 for 8×8 PEs, and the performance scaling becomes of 2 × 2 and 4 × 4 PEs. As a result, their Fperf is close to the
flat at larger n. peak. However, the design of 8 × 8 PEs requires 51.5GB/s,
We then evaluate the numerical error EL1 of computation which is 1.5x more significant than the available bandwidth.
results between our binary128 GEMM designs and the Rgemm Therefore, the design of 8 × 8 PEs is limited by memory
routine based on Eq. (6). EL1 for n < 512 is distributed transfer from DRAM. As a result, we see that the ratio between
between 10−31 and 10−30 . As we set n to 4096, EL1 increases Fperf and Fpeak is much lower than that of other designs of
to 2.0 × 10−28 . The layout of PEs does not make a significant fewer PEs.
difference in EL1 . 2) Effects of Memory Buffer for The Systolic Array:
Regarding the comparison between Fperf and Fpeak , a ratio To enhance performance, we instantiate more PEs in our
to designs of 2 × 2, 4 × 4, and 8 × 8 PEs is 99.5%, 97.3%, binary128 GEMM design. However, the memory bandwidth
and 58.2%, respectively. Recall that the memory bandwidth of the FPGA board poses a limitation. Therefore, the systolic
requirement Breq is given by Eq. (5). As we substitute Fmax of array generated by FBLAS has a module called memory
each design in Fig. 2 with f in Eq. (5), we find Breq 15.1GB/s, tile in front of the Feed module. It is a local memory
29.2GB/s and 51.5GB/s for 2×2, 4×4 and 8×8, respectively. buffer working as a cache memory for each PE to mitigate
Our Arria10 system has two DDR3 memories that provide the memory bandwidth requirements provided in Eq. (5). As
34.2GB/s of the total bandwidth. It is sufficient for the designs the systolic array incorporates a more significant number of
PEs, increasing the size of MTile is necessary to provide the
larger buffer in our binary128 GEMM designs.
The results presented in Sec. IV-B1 were all obtained by
the designs with MTile = 32. We then conduct additional
benchmarking to further investigate the potential performance
improvement by adopting a larger value of MTile . Fig. 3
illustrates the performance of the GEMM by using the designs
of 4 × 4 and 8 × 8 PEs where MTile ranges from 24 to 256.
However, compared to the design of 8×8 PEs, its performance like k ≤ 128, even the performance on Agilex is just a few
improvement is sluggish because the Fmax of the 8 × 16 PEs GFlops. As a result, the advantage of our binary128 GEMM
significantly dropped and led to a low Fpeak of the design. designs compared to computation on CPUs is lost.
As we examine the performance of the designs on Agilex,
the optimization of PE layout and MTile successfully con- V. A PPLICATION OF BINARY 128 M ATRIX
tributed to performance improvement. While the design of M ULTIPLICATION
8 × 8 PEs with MTile = 128 certainly performs effectively, Once we have our binary128 GEMM designs by the systolic
that of 8 × 16 PEs with MTile = 512 is much better. The array architecture, we can accelerate practical applications
computation by the 8 × 16 PEs achieved 90.9GFlops, 91% of which require binary128 GEMM operations. We here describe
the peak, for the largest matrix size of n = 24576 in contrast two applications of our implementation with performance
to one by the 8 × 8 PEs yielding 50.4GFlops at n = 18000, evaluation. In this section, Rn×n denotes n × n real matrices.
about 96% of its peak.
The importance of the size of MTile can be easily under- A. Blocked LU Decomposition
stood by comparing it with a reference plot for the design 1) Problem Specification of LU Decomposition: The LU
of 8 × 16 PEs with MTile = 128 on Agilex. If we set Decomposition is a fundamental operation in numerical analy-
MTile = 128, the performance of the design is at most sis that factorizes the given square matrix A as a multiplication
77GFlops, which is only 77% of the peak. In particular, a of lower and upper triangular matrices like A = LU where L
trench in the plot at n = 16384 results in a significant and U are lower and upper triangular matrices, respectively.
performance drop to 54.1GFlops around that point. One reason Based on BLAS routines, the LU decomposition in binary64
may be that those specific large matrices accidentally cause precision is implemented as a routine called dgetrf in LA-
accesses that stride over different memory banks on four PACK. The degetf routine adopts a blocked LU decomposition
independent DIMMs on the Agilex FPGA board. However, the algorithm thoroughly investigated and implemented for every
memory buffer exploited by the larger MTile (e.g. 512) helps supercomputer in the last four decades. Its variation is the most
to alleviate problems related to unexpected memory access famous parallel benchmarking program called LINPACK. The
patterns and facilitates steady performance improvement. blocked LU decomposition algorithm effectively solves dense
Finally, our binary128 GEMM design is very high per- linear equations on accelerator architectures like GPU since
formance compared to the Rgemm routine executed on the its computation is mainly processed as GEMM operations.
CPU with 20 threads. Its performance settles at 650MFlops Let us consider the LU decomposition for a matrix A ∈
for n > 1024. Therefore, we have a significant advantage Rn×n with the block size b, as shown in Fig. 7. Then, we
in processing large matrices. The design of 8 × 16 PEs with obtain L and U on A by repeating the following procedure
MTile = 512 on Agilex is 145x faster than the computation recursively.
on a recent CPU with the maximum number of threads. 1) Divide A into 4 sub-matrices: A11 ∈ Rb×b , A12 ∈
In addition, we show the performance of our binary128 Rb×(n−b) , A21 ∈ R(n−b)×n , and A22 ∈ R(n−b)×(n−b) .
GEMM designs for non-square matrices on Stratix and Agilex 2) Perform decomposition A11 = L11 U11 .
FPGAs. Fig. 6 shows the benchmarking result when m and 3) Solve U12 that satisfies L11 U12 = A12 .
n are fixed to m = n = 16384, and k is scaled between 32 4) Solve L21 that satisfies L21 U11 = A21 .
and 16384. As presented in the benchmarking on Arria10, the 5) Update A22 by A22 = A22 − L21 U12 .
performance drop for ratios of n : k < 2 : 1 is not significant. 6) If n − b > 0 still holds, go back to step 1 after
However, for tall-skinny matrices where k is particularly small, substituting A with A22 .
Fig. 8. Performance of LU decomposition on Stratix10 and Agilex FPGAs
Fig. 7. Blocked LU Decomposition of a matrix A where the block size is b
In step 5, we have matrix multiplication L21 U12 . When b = We observe that b = 108 yields the best performance on
1, the blocked LU decomposition is reduced to a non-blocked the Agilex FPGA as represented by 2.5GFlops at n = 20000.
routine called dgetrf2 in LAPACK. When b is large enough, the However, with a large matrix of n = 24576, a higher b
computation of dgetrf is dominated by GEMM operations in yields the peak. We can see in the figure that the highest
step 5. Accordingly, it can be accelerated by GEMM routines performance is 2.6GFlops obtained with b = 144 for the
on GPUs or FPGAs. matrix of n = 24576. On the other hand, the performance
In MPLAPACK [11], all BLAS and LAPACK routines are deteriorates when we apply even larger values of b such
extended to support multi-precision FP operations, including as b = 192 and 256, yielding 2.3GFlops and 2.1GFlops,
binary128. We modify an extended version of dgetrf in MPLA- respectively. Similarly, the design on the Stratix10 FPGA is
PACK called Rgetrf, which calls the Rgemm routine. In this superior to the CPU computation for n > 3000. Although it
paper, we replace calls to Rgemm with our binary128 GEMM is slower than the computation on the Agilex FPGA, it finally
operations executed on FPGAs. reaches 2.2GFlops at n = 20000, which is 4.7x faster than
The number of FP operations in the LU decomposition that of the CPU.
2n3 n2 5n 2n3 Since the performance on FPGAs improves slowly by
algorithm is − + [27]. Here, we regard it as .
′
3 2 6 3 scaling n until computation data saturate every PE, the per-
Therefore, Fperf as shown in Eq. (7) gives the computation
performance for the following evaluation. formance on the CPU for small n is superior to that of
FPGAs. When the matrix size n = 512, the smallest size in
′ 2n3 this evaluation, the performance on the CPU is 278MFlops
Fperf = (7) which is 2 to 3x faster than that of FPGAs. We see that
3 × Texec × 109
2) Evaluation of GEMM for LU Decomposition: We as- the intersection of the performance scaling between the CPU
sume that an input n × n matrices whose elements are given and FPGAs is around n = 1536. The performance of the
by random numbers in a range of [0.0, 1.0). Then, the input CPU execution does not improve for n > 2000, which is
matrices can be factorized by the LU decomposition. We 458MFlops at n = 24576. In contrast, the performance of the
decompose the square matrices by applying our binary128 LU decomposition by using our binary128 GEMM designs on
GEMM designs in the algorithm. Agilex FPGA is at a maximum of 5.3x faster than that of the
Based on the evaluation in the previous section, we measure CPU.
the performance of blocked LU decomposition with the design We compare the decomposed matrices L and U calculated
of 8×16 PEs on Agilex FPGA. We scale the size of matrices n by the designs on FPGAs with the reference result calculated
and apply different block sizes b to find the optimal size of b. by the CPU by using Eq. (6). In the case of n ≤ 1536, where
As a comparison, we present a result on the design of 8 × 16 the CPU computation is still faster than FPGAs, we find EL1
PEs on Stratix10 where b = 128. We also give an another ∼ 10−31 . On the other hand, as we test for the matrix of
comparison with a result obtained through computation using n = 24576, we find EL1 ∼ 10−28 . This consequence is the
only the host CPU (Intel Core i9-10900). In that computation, same as we expected, considering the previous evaluation of
the Rgetrf routine in MPLAPACK takes charge of the LU our binary128 GEMM design.
decomposition with 20 threads by OpenMP parallelization. Finally, we compare our results with those of previous work
Fig. 8 summarizes our results of the LU decomposition. by Kouya [16], who presented optimizations of LU decom-
For Agilex FPGA, we present the performance in each case position using DD arithmetic. Specifically, they have applied
of b = 108, 128, 144. The black line shows the performance memory blocking and vectorization using AVX2 instructions
scaling obtained by the computation on the CPU. and evaluated the performance on an Intel Core i9-10900X
CPU. According to their benchmarking for n = 1024, the We test different Nmin and find that Nmin = 106 to 107
performance of a conventional blocked LU decomposition is optimal for the SDPA. We only present the performance
code with b = 64 was 132MFlops. Similarly, the performance benchmarking of the SDPA on Agilex FPGA for selected
of a vectorized LU decomposition code with b = 32 was problems from SDPLIB shown in Table IV. We present the
363MFlops. In contrast, our result with the design of 8 × 16 elapsed time per iteration of the SDPA-binary128 on the three
PEs achieved 324.5MFlops for n = 1024 and b = 108 on systems: CPU-A (Intel Xeon Gold 5122 4 cores @ 3.60GHz),
an Agilex FPGA. Even the fastest design on the high-end CPU-B (Intel i9-10900 CPU 10 cores @ 2.80GHz), and CPU-
FPGA is not significantly beneficial for small matrices. As B using our binary128 GEMM design of 8×16 PEs on Agilex.
a result, from performance perspective for small matrices, our The performance with the FPGA is 2 to 4x and roughly 1.5x
binary128 GEMM designs are inferior to the vectorized LU faster than that of CPU-A and CPU-B, respectively. Note that
decomposition code on a CPU. the performance of SDPA-binary128 on CPUs is proportional
However, we emphasize that our designs on recent FPGAs to the number of cores on a given CPU.
are much more effective for large n. With the current best We verify that each solution computed by our binary128
performance of our LU decomposition being 2.5GFlops, our GEMM design improves upon the solution obtained via
FPGA designs are superior for large matrices. It is also worth double-precision calculations. As illustrated in Table V, we
noting that our work and the work by Kouya [16] use different present the relative gaps, primal/dual feasible errors, and the
FP formats. DD arithmetic is well suited for recent high-end numbers of iterations for problems theta2, theta3, theta4,
CPUs equipped with vector arithmetic units such as AVX2 theta6, and controll11 from SDPLIB, as computed on CPU-
and AVX512 instructions on the x86-64 ISA, Neon, and SVE B using binary128, FPGA (Agilex) using our design, the DD
instructions on the ARM ISA. precision version [5], and the double precision version [32].
As smaller errors indicate better results, the solutions obtained
B. Semidefinite Programming (SDP) via our binary128 GEMM design exhibit an improvement over
SDP is an optimization problem to minimize or maximize those obtained via double precision calculations and are of
a given linear function under the constraint of symmetric comparable or slightly superior quality to those obtained via
semidefinite matrices. It has vast possible applications in DD arithmetic. Our binary128 Rgemm accelerated by FPGAs
engineering [28], finance [29], quantum chemistry [30], and effectively accelerates the PDIPM for SDP problems.
physics [31], which have been investigated for a long time.
SDPA [32] is a numerical implementation and software TABLE IV
package for SDP written in C++ [33]. The algorithm used in E LAPSED TIME PER ITERATION IN SEC . OF THE SDPA ON CPU S AND
AGILEX FPGA
the SDPA is called the PDIPM, one of the iteration methods for
Problem CPU-A CPU-B FPGA(Agilex)
SDP. Previous research [5] has extended the SDPA to support
theta2 0.8 0.45 0.42
various precision FP operations such as SDPA-GMP, -DD, and theta3 4.99 2.68 2.11
QD [5]. The GMP version uses arbitrary precision arithmetic. theta4 21.17 10.24 7.28
Thus, a user must specify the precision beforehand. These theta5 69.35 30.82 20.17
extended versions of the SDPA use a part of MPLAPACK [11] theta6 191.4 79.54 48.3
as a back-end, mainly through calling the Rgemm routine. control11 66.92 38.09 28.51
equalG51 141.04 66.87 33.32
To determine which parameters are utilized in GEMM rou- gpp500-1 18.45 8.53 4.47
tines called from the SDPA, we conduct 92 problems provided gpp500-2 18.58 8.53 4.56
by SDPLIB [34] using SDPA-binary128 with MPLAPACK. maxG11 53.35 25.66 16.39
As we are currently focusing on accelerating GEMM routines maxG32 803.42 380.92 232.81
maxG51 108.69 53.34 34.19
in our work, we have modified the code to record the 13
mpc500-1 11.49 5.39 3.36
arguments specified in Listing 1 for the Rgemm routine during mpc500-4 14.37 7.31 4.9
the execution of all problems. qpG11 264.96 111.95 64.48
Analysis of the collected data reveals that the SDPA fre- qpG51 480.79 207.78 120.98
quently calls the Rgemm routine with non-square matrices, thetaG11 82.11 41.7 28.55
thetaG51 853.03 387.67 248.86
and none of the leading dimensions of the matrices in the
Rgemm routine equal m, n, or k. Of the over 800 combinations
of arguments recorded in the collected data, we find only 50
combinations where the condition n = m = k = lda = ldb = C. Discussions on Application Performance
ldc holds. As shown in Sec. IV-B1, the performance of our The blocked LU decomposition algorithm Rgetrf outlined
binary128 GEMM designs on FPGAs for non-square matrices in Sec. V-A employs the Rgemm operation to compute A22 =
is inferior to that for square matrices. L21 U12 , where both matrices are non-square and skinny.
Based on our analysis, we evaluate the performance of the L21 and U12 are matrices of dimensions b × k and k × b,
SDPA calling Rgemm operation accelerated by an FPGA only respectively. During the loop from step 2 to step 6, k is
when either two conditions are satisfied; (1) m equals n or (2) reduced as k = n−pb, where p represents the iteration number
m × n × k is larger than a predefined parameter Nmin = 106 . starting from p = 1. At an initial phase of the algorithm, k is
TABLE V 100x faster than the reference Rgemm on a 10-core CPU, the
T HE RELATIVE GAPS , PRIMAL / DUAL FEASIBLE ERRORS , AND THE two applications evaluated in this section are not substantially
NUMBER OF ITERATIONS FOR CERTAIN PROBLEMS FROM SDPLIB WERE
CALCULATED ON CPU-B USING BINARY 128, FPGA (AGILEX , accelerated by the FPGA. Therefore, to make our binary128
BINARY 128), THE DOUBLE - DOUBLE PRECISION VERSION (DD), AND THE GEMM designs on FPGAs more practical for real-world
DOUBLE PRECISION VERSION . applications, we will need to extensively modify the systolic
Problem CPU-B FPGA DD double array design generated by FBLAS to address the performance
theta2 degradation for small matrices and non-square matrices. A
relative gap 1.05e-24 1.16e-24 2.68e-25 1.45e-08 potential solution is to develop an extended version of Rgemm
p.feas.error 3.70e-32 7.70e-33 4.93e-31 3.55e-15 that incorporates another level of blocking in the host code.
d.feas.error 2.14e-25 2.89e-25 4.51e-27 5.77e-15
# of iterations 58 62 51 17
Specifically, we could develop a new Rgemm API based on a
theta3 batched GEMM algorithm [35]. It would allow us to instantiate
relative gap 5.71e-25 5.35e-24 1.86e-23 1.57e-08 multiple systolic arrays on an FPGA to handle the batched
p.feas.error 1.08e-32 1.23e-32 9.86e-31 8.88e-15 GEMM algorithm. One of a hardware implementation of a
d.feas.error 2.43e-26 6.40e-25 1.42e-24 4.00e-15 batched GEMM algorithm focusing on 64 and smaller bits of
# of iterations 55 50 61 17
FP numbers was reported by Ledoux et al. [36]. Their systolic
theta4
relative gap 5.15e-25 5.89e-25 6.18e-27 2.25e-08 array design leverages a stalling-free output scheme for the
p.feas.error 2.62e-32 1.85e-32 7.89e-31 7.11e-15 output matrix C to maximize the overlap of host data transfers
d.feas.error 3.51e-26 9.35e-26 5.06e-28 1.47e-14 with GEMM computations.
# of iterations 71 94 52 18
theta6 VI. C ONCLUSION
relative gap 6.90e-31 1.23e-30 6.28e-25 2.45e-08
p.feas.error 1.39e-32 1.85e-32 8.87e-31 1.42e-14
In this paper, we presented our binary128 GEMM imple-
d.feas.error 5.74e-32 5.80e-32 5.19e-26 5.04e-14 mentation and its evaluation of different Intel FPGAs, and
# of iterations 45 46 54 18 its integration into numerical applications such as blocked
control11 LU decomposition and SDP. Our GEMM designs on FPGAs
relative gap 9.01e-25 1.87e-23 2.24e-22 8.26e-06 are based on the 2-D systolic array generated by the FBLAS
p.feas.error 1.62e-27 2.02e-27 2.41e-25 1.86e-09
library. Furthermore, by optimizing memory buffer size, which
d.feas.error 1.60e-24 4.51e-24 1.50e-22 2.03e-07
# of iterations 64 62 60 47 stores reused data in fast on-chip memory, we successfully
implemented 8×16 PEs to accelerate the GEMM in binary128
arithmetic on FPGAs.
The benchmarking in this paper showed that our imple-
large enough such that our binary128 GEMM designs on the mentation is particularly advantageous when computing large
Agilex FPGA effectively accelerate the performance of Rgetrf. matrices of size n > 104 . For example, in our evaluation of
However, as k becomes much smaller than n at a later phase of our binary128 GEMM implementation on the Agilex FPGA,
the algorithm, the acceleration by the Agilex FPGA becomes the performance was 90.9GFlops, 91% of the estimated peak
ineffective. The blocking size b also impacts the performance performance of the design. This resulted in a 147x speed-up
of the GEMM on FPGAs. For instance, if b is too small, the compared to the Rgemm routine provided by MPLAPACK on
performance of Rgemm on FPGAs is significantly reduced, as an i9-10900 CPU with 20 threads.
depicted in Figs. 2 and 5. Further benchmarking of various matrix multiplications
On the other hand, the PDIPM frequently calls the Rgemm showed that our designs are pretty effective to accelerate
operation for small non-square matrices with a wide range GEMM operations for square and almost-square matrices. In
of combinations of matrix sizes n, k, and m. The largest other words, LU decomposition can be solved faster using
matrix size in all problems presented in Table IV is only our implementation than with existing CPU routines. However,
n = k = m = 2000. With a matrix size of n = k = m = our design was not effective at handling tall-skinny matrices,
2000, the performance of Rgemm on FPGAs is half the peak commonly found to solve semidefinite programming.
performance. In most cases, the algorithm calls the Rgemm Our current systolic array designs for GEMM operations
operation for much smaller matrices when it is not executed are based on the OpenCL kernels generated by the latest
on the FPGA. In a previous evaluation of a fast GEMM in DD version of FBLAS [37]. The FBLAS is designed to be flexible
arithmetic on GPUs by Nakata et al. [15], it was shown that and accommodate various kernel configurations for different
the performance of the PDIPM in DD arithmetic accelerated BLAS routines, such as General Matrix-Vector Multiplication
by a GPU is more than 10x faster than that on a CPU with (GEMV) and Triangular Solve with Multiple Right-Hand
four cores. According to their results, the size of matrices does Sides (TRSM). However, in this study, we extracted only the
not significantly affect the performance of Rgemm on GPU. systolic array kernels of GEMM for our work. Extending our
Therefore, they have always utilized the GPU, except for very work to other BLAS routines would be an interesting area for
small matrices. future research.
Despite the superior performance of our accelerated Rgemm There is still room for optimization to improve the per-
implementation on the Agilex FPGA, which is more than formance of our GEMM design when we use it to calcu-
late tall-skinny matrix multiplications. Further optimizations [13] T. Dekker, “A Floating-Point Technique for Extending the Available
are necessary to achieve the desired performance, especially Precision,” Numerische Mathematik, vol. 18, pp. 224–242, 1971.
[14] D. Kunuth, The Art of Computer Programming vol.2 Seminumerical
for SDP problems. In future work, we will compare such Algorithms, 1st ed. Reading, Massachusetts: Addison Wesley, 1998.
optimized GEMM designs with other high-precision GEMM [15] M. Nakata, Y. Takao, S. Noda, and R. Himeno, “A fast implementation
implementations on accelerators. Another area of future work of matrix-matrix product in double-double precision on nvidia c2050 and
application to semidefinite programming,” in 2012 Third International
will be to explore other FP formats in our GEMM designs Conference on Networking and Computing, 2012, pp. 68–75.
by replacing the current binary128 multiply-add units with [16] T. Kouya, “Acceleration of lu decomposition supporting double-double,
multiply-add units in different arithmetic. triple-double, and quadruple-double precision floating-point arithmetic
with avx2,” in 2021 IEEE 28th Symposium on Computer Arithmetic
(ARITH), 2021, pp. 54–61.
ACKNOWLEDGMENT [17] M. Joldes, J.-M. Muller, V. Popescu, and W. Tucker, “Campary: Cuda
multiple precision arithmetic library and applications,” in Mathemat-
A part of this paper is based on results obtained from a ical Software – ICMS 2016, G.-M. Greuel, T. Koch, P. Paule, and
project, JPNP16007, commissioned by the New Energy and A. Sommese, Eds. Cham: Springer International Publishing, 2016,
Industrial Technology Development Organization (NEDO). pp. 232–240.
[18] K. Isupov and V. Knyazkov, “Multiple-precision blas library for graphics
This work was partly supported by MEXT as ”Feasibility processing units,” in Supercomputing, V. Voevodin and S. Sobolev, Eds.
studies for the next-generation computing infrastructure” and Cham: Springer International Publishing, 2020, pp. 37–49.
KAKENHI Grant Number JP23K11133. [19] I. Flores, “Residue arithmetic and its application to computer technology
(nicholas s. szabo and richard i. tanaka),” SIAM Review, vol. 11, no. 1,
This research in part used computational resources of pp. 103–104, 1969. [Online]. Available: https://doi.org/10.1137/1011027
Cygnus provided by Multidisciplinary Cooperative Research [20] T. Nakayama and D. Takahashi, “Implementation of multiple-precision
Program in Center for Computational Sciences, University of floating-point arithmetic library for gpu computing,” in Proceedings of
the 23rd IASTED International Conference on Parallel and Distributed
Tsukuba. Computing and Systems (Dallas, USA, December 2011), A. ACTA
We thank Prof. Ishikawa, High Energy Accelerator Re- Press, 2011, pp. 343–349.
search Organization, and Prof. Daisaka, Hitotsubashi Univer- [21] D. Mukunoki, K. Ozaki, T. Ogita, and T. Imamura, “Accurate matrix
multiplication on binary128 format accelerated by ozaki scheme,” in
sity, Japan, for their help evaluating our designs on Stratix10. 50th International Conference on Parallel Processing, ser. ICPP 2021.
New York, NY, USA: Association for Computing Machinery, 2021.
R EFERENCES [Online]. Available: https://doi.org/10.1145/3472456.3472493
[22] K. Ozaki, T. Ogita, S. Oishi, and S. Rump, “Error-free transformations
[1] “Ieee standard for floating-point arithmetic,” IEEE Std 754-2019 (Revi- of matrix multiplication by using fast routines of matrix multiplication
sion of IEEE 754-2008), pp. 1–84, 2019. and its applications,” Numerical Algorithms, vol. 59, no. 1, pp. 95–118,
[2] T. Norrie, N. Patil, D. H. Yoon, G. Kurian, S. Li, J. Laudon, C. Young, Jan. 2012, copyright: Copyright 2011 Elsevier B.V., All rights reserved.
N. Jouppi, and D. Patterson, “The design process for google’s training [23] J. de Fine Licht, G. Kwasniewski, and T. Hoefler, “Flexible commu-
chips: Tpuv2 and tpuv3,” IEEE Micro, vol. 41, no. 2, pp. 56–63, 2021. nication avoiding matrix multiplication on fpga with high-level syn-
[3] L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM thesis,” in The 2020 ACM/SIGDA International Symposium on Field-
Review, vol. 38, no. 1, pp. 49–95, 1996. [Online]. Available: Programmable Gate Arrays (FPGA’20), 2020.
https://doi.org/10.1137/1038003 [24] J. de Fine Licht, C. A. Pattison, A. N. Ziogas, D. Simmons-Duffin, and
[4] F. Alizadeh, J.-P. A. Haeberly, and M. L. Overton, “Complementarity T. Hoefler, “Fast arbitrary precision floating point on fpga,” in 2022 IEEE
and nondegeneracy in semidefinite programming,” Mathematical 30th Annual International Symposium on Field-Programmable Custom
Programming, vol. 77, pp. 111–128, 1997. [Online]. Available: Computing Machines (FCCM), 2022, pp. 1–9.
https://doi.org/10.1007/BF02614432 [25] L. Fousse, G. Hanrot, V. Lefèvre, P. Pélissier, and P. Zimmermann,
[5] M. Nakata, “A numerical evaluation of highly accurate multiple- “Mpfr: A multiple-precision binary floating-point library with correct
precision arithmetic version of semidefinite programming solver: Sdpa- rounding,” ACM Trans. Math. Softw., vol. 33, no. 2, pp. 13–es, jun
gmp, -qd and -dd.” in 2010 IEEE International Symposium on Computer- 2007. [Online]. Available: https://doi.org/10.1145/1236463.1236468
Aided Control System Design, 2010, pp. 29–34. [26] “Information technology — programming languages, their environ-
[6] C. Lichtenau, S. Carlough, and S. M. Mueller, “Quad precision floating ments, and system software interfaces — floating-point extensions for c
point on the ibm z13,” in 2016 IEEE 23nd Symposium on Computer — part 3: Interchange and extended types,” International Organization
Arithmetic (ARITH), 2016, pp. 87–94. for Standardization, Geneva, CH, Tech. Rep. ISO/IEC TS 18661-3:2015,
[7] K. Nagasu, K. Sano, F. Kono, and N. Nakasato, “Fpga-based tsunami 2015.
simulation: Performance comparison with gpus, and roofline model for [27] S. Blackford and J. Dongarra, “Lapack working note 41 installation
scalability analysis,” Journal of Parallel and Distributed Computing, vol. guide for lapack 1,” 01 1999.
106, pp. 153–169, Aug. 2017, publisher Copyright: © 2016 Elsevier Inc. [28] M. Ohsaki, K. Fujisawa, N. Katoh, and Y. Kanno, “Semi-definite pro-
[8] H. Kung, C. Leiserson, C.-M. U. P. P. D. of COMPUTER SCIENCE., gramming for topology optimization of trusses under multiple eigenvalue
and C. M. U. C. S. Department, Systolic Arrays for (VLSI), ser. constraints,” Computer Methods in Applied Mechanics and Engineering,
CMU-CS. Carnegie-Mellon University, Department of Computer vol. 180, pp. 203–217, 1999.
Science, 1978. [Online]. Available: https://books.google.co.jp/books? [29] A. Gepp, G. Harris, and B. Vanstone, “Financial applications of semidef-
id=pAKfHAAACAAJ inite programming: a review and call for interdisciplinary research,”
[9] T. De Matteis, J. de Fine Licht, and T. Hoefler, “Fblas: Streaming linear Accounting and Finance, vol. 60, 09 2019.
algebra on fpga,” in Proceedings of the International Conference for [30] M. Fukuda, B. Braams, M. Nakata, M. Overton, J. Percus, M. Yamashita,
High Performance Computing, Networking, Storage and Analysis, ser. and Z. Zhao, “Large-scale semidefinite programs in electronic structure
SC ’20. IEEE Press, 2020. calculation,” Math. Program., vol. 109, pp. 553–580, 03 2007.
[10] N. Nakasato, H. Daisaka, and T. Ishikawa, “High performance high- [31] D. Poland, S. Rychkov, and A. Vichi, “The conformal bootstrap:
precision floating-point operations on fpgas using opencl,” in 2018 Theory, numerical techniques, and applications,” Rev. Mod. Phys.,
International Conference on Field-Programmable Technology (FPT), vol. 91, p. 015002, Jan 2019. [Online]. Available: https://link.aps.org/
2018, pp. 262–265. doi/10.1103/RevModPhys.91.015002
[11] M. Nakata, “Mplapack version 2.0.1 user manual,” 2022. [32] M. Yamashita, K. Fujisawa, M. Fukuda, K. Kobayashi, K. Nakata,
[12] N. Nakasato, “A fast gemm implementation on the cypress gpu,” and M. Nakata, Latest Developments in the SDPA Family for Solving
SIGMETRICS Performance Evaluation Review, vol. 38, pp. 50–55, 03 Large-Scale SDPs. Boston, MA: Springer US, 2012, pp. 687–713.
2011. [Online]. Available: https://doi.org/10.1007/978-1-4614-0769-0 24
[33] “Sdpa(semidefinite programming algorithms) official page.” [Online].
Available: http://sdpa.sourceforge.net/
[34] “Sdplib 1.2.” [Online]. Available: https://github.com/vsdp/SDPLIB/
[35] A. Haidar, T. Dong, P. Luszczek, S. Tomov, and J. Dongarra,
“Optimization for performance and energy for batched matrix
computations on gpus,” in Proceedings of the 8th Workshop on General
Purpose Processing Using GPUs, ser. GPGPU-8. New York, NY,
USA: Association for Computing Machinery, 2015, p. 59–69. [Online].
Available: https://doi.org/10.1145/2716282.2716288
[36] L. Ledoux and M. Casas, “A generator of numerically-tailored and high-
throughput accelerators for batched gemms,” in 2022 IEEE 30th Annual
International Symposium on Field-Programmable Custom Computing
Machines (FCCM), 2022, pp. 1–10.
[37] “Fblas.” [Online]. Available: https://github.com/spcl/FBLAS/