0% found this document useful (0 votes)
36 views6 pages

Minres Algorithm FPGA

This paper presents an implementation of the MINRES algorithm for solving systems of linear equations on an FPGA. It achieves a performance of up to 53 GFLOPS for matrices up to size 145, compared to previous hardware solutions. The FPGA implementation exploits on-chip memory and deeply pipelined floating-point units to efficiently execute the iterative MINRES algorithm.

Uploaded by

fikri.hafid347
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views6 pages

Minres Algorithm FPGA

This paper presents an implementation of the MINRES algorithm for solving systems of linear equations on an FPGA. It achieves a performance of up to 53 GFLOPS for matrices up to size 145, compared to previous hardware solutions. The FPGA implementation exploits on-chip memory and deeply pipelined floating-point units to efficiently execute the iterative MINRES algorithm.

Uploaded by

fikri.hafid347
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

AN FPGA-BASED IMPLEMENTATION OF THE MINRES ALGORITHM

David Boland, George A. Constantinides ∗

Electrical and Electronic Engineering Department,


Imperial College London
[email protected], [email protected]

ABSTRACT [4] and QR decomposition [5]. The salient features of these


Due to continuous improvements in the resources available implementations are highlighted in Table 1.
on FPGAs, it is becoming increasingly possible to accelerate However, it is the use of iterative methods to solve lin-
floating point algorithms. The solution of a system of linear ear equations that is of growing interest within the field of
equations forms the basis of many problems in engineering FPGAs for these methods largely consist of multiply accu-
and science, but its calculation is highly time consuming. mulate operations as opposed to division operations, which
The minimum residual algorithm (MINRES) is one method are often required in direct methods (for example any piv-
to solve this problem, and is highly effective provided the oting operation). It is well known that multiply-accumulate
matrix exhibits certain characteristics. This paper examines operations are highly suited to FPGAs due to their use in
an IEEE 754 single precision floating point implementation DSP type applications, whilst the suitability of performing
of the MINRES algorithm on an FPGA. It demonstrates that this type of operation in floating point is justified by studies
through parallelisation and heavy pipelining of all floating demonstrating efficient matrix multiplications using floating
point components it is possible to achieve a sustained per- point [6, 7]. Furthermore, iterative methods initialized with
formance of up to 53 GFLOPS on the Virtex5-330T. This a good initial guess will converge much faster and this is the
compares favourably to other hardware implementations of case in many scientific computing problems, especially in
floating point matrix inversion algorithms, and corresponds optimisation [8]. For these reasons there has been growing
to an improvement of nearly an order of magnitude com- interest in using one specific iterative method, the Conjugate
pared to a software implementation. Gradient algorithm, to solve these problems [9, 10]. How-
ever, it can be shown that the Conjugate Gradient algorithm
1. INTRODUCTION breaks down if the A matrix is not symmetric positive defi-
nite, and hence a more general method is desirable.
The solution to a system of linear equations of the form This paper examines the MINRES algorithm [11], which
Ax = b (where A is an N × N matrix, while x and b are is another iterative algorithm that is an efficient solver for
N × 1 vectors) forms the basis of a large number of prob- cases where the A matrix is symmetric. The motivation for
lems, most notably in the realm of scientific computing. In choosing this specific algorithm is that it strikes a good bal-
order to obtain accurate results, it is often desirable to solve ance between complexity and generality in that many scien-
such problems using floating point representation. As re- tific computing problems could be mapped to problems with
search has identified trends indicating the floating point per- symmetric matrices that are not necessarily positive definite
formance of FPGAs will significantly exceed that of tradi- [8].
tional processors [1], using an FPGA to generate an efficient To the best knowledge of the authors, there has as yet
solution to these problems is an important field of study. been no implementation of the MINRES algorithm on an
There are two main families of methods to solve this FPGA and therefore any direct comparison is not possible.
type of linear algebra problem: direct methods and iterative However, compared to alternative hardware implementations
methods. Direct methods find the solution in one shot, typi- of matrix inversion, this implementation achieves a signifi-
cally via some form of computationally intensive matrix fac- cantly higher performance, as shown in Table 1. It should
torization, whereas iterative methods refine a solution with be noted that some other implementations can process larger
each iteration. FPGA implementations for a variety of direct matrix orders. However, this depends upon the exploitation
methods have been created, both in fixed point: Cholesky of sparse structure in the matrices or through using off-chip
[2] and floating point; Gauss-Jordan [3], LU decomposition RAM to store intermediate results. Sparse matrix solvers
∗ The authors would like to acknowledge the support of the EPSRC can handle much larger matrices as it is not necessary to hold
(Grants EP/C549481/1 and EP/E00024X/1) or operate on any zeros in the matrix, but are less general by
Table 1. Comparison of Floating Point Matrix Inversion Methods
Method Year GFLOPS Vs. Software Max Order Sparsity Off-chip RAM Device Requirements of A
Gauss-Jordan [3] 2006 N/A 4× 1700 Dense Yes Virtex II Non-Singular
Direct LU [4] 2006 2.6 6× 1000 Dense Yes Stratix II Non-Singular
QR [5] 2008 35 N/A 12 Dense No Virtex5 Non-Singular
Conjugate Gradient [9] 2006 N/A 1.3 × 2000 Sparse Yes 2×Virtex II Symmetric Positive Definite
Iterative Conjugate Gradient [10] 2008 35 5× 58 Dense No Virtex5 Symmetric Positive Definite
Minres this 2008 53 8.8 × 145 Dense No Virtex5 Symmetric

definition. Using off-chip RAM to hold the intermediate re- using the Lanczos process [11]. Overall, the pseudo code is
sults enables larger matrices to be held, but the I/O require- described in Fig. 1, with the major sections of the algorithm
ments to load onto the FPGA creates a bottle-neck (typically highlighted.
determined by the number of off-chip RAMs) upon perfor- The Conjugate Gradient Method can also be interpreted
mance and efficiency, typically leading to low speedups, as as an algorithm that makes use of the Lanczos process and
shown in Table 1 for [3] and [4]. The embedded memories therefore there are some similarities between the two meth-
on modern FPGA devices, however, now have the capac- ods [13]. For cases where it is desirable to compare hard-
ity to buffer large matrices on chip, a technique we exploit ware implementations of these two methods, it is important
to break this bottleneck, at the cost of limiting the order of to highlight the two major differences in terms of hardware
matrix to up to 145. This order of matrix is considered rel- costs, both of which are a result of working with the two-
atively large for dense problems and is sufficient for many norm. Firstly normalisation is required, resulting in square
applications which depend upon matrix inversion [12], and root operations. Secondly, it results in a three-term recur-
could be used as a building block for solving larger systems. rence as opposed to the two-term recurrence in the Con-
Moreover, it is 2 to 12 times larger than previous dense on- jugate Gradient algorithm; this increases storage require-
chip solvers. ments. Thus the MINRES algorithm trades an increase in
The main contributions of this paper are: circuit complexity for the ability to solve a wider class of
problems.
• A demonstration of the suitability of the MINRES al-
gorithm for use on an FPGA,
3. IMPLEMENTATION
• An analysis of the design decisions and trade-offs in-
volved to create an optimum floating point implemen- The optimal hardware implementation is dependent upon a
tation of the MINRES algorithm in hardware, includ- number of factors - number of resources, latency (in terms
ing efficiency and pipeline depth, of cycles per iteration), throughput and efficiency (in terms
• A design for solving multiple dense systems of linear of the amount of time resources are in use). The design
equations in a pipeline for orders up to 145 using the described aims to achieve a good balance in terms of opti-
MINRES algorithm, with results demonstrating a sus- mising these factors. The following sections detail the main
tained performance, taking into account I/O overhead, considerations and potential trade-offs between these factors
of up to 53 GFLOPS. and justifies any major decisions. To aid description for this
section, a diagram of the circuit is shown in Fig. 2.
This paper describes the MINRES algorithm and briefly
highlights both the advantages and the additional complex-
3.1. Floating Point Units
ities over its closest relative, the Conjugate Gradient algo-
rithm, in Section 2. The hardware implementation is de- Xilinx Core Generator was used to generate the floating point
scribed in Section 3, along with its associated results in Sec- components for the circuit. This environment enables the
tion 4. Finally Section 5 concludes this paper. user to trade latency for maximum clock frequency. Pro-
vided the pipeline remains full, due to the increased clock
2. MINRES ALGORITHM frequency, a component with a higher latency potentially has
a higher throughput.
The MINRES algorithm finds a (potentially approximate) It is possible to maximise the amount of time the pipeline
solution xk , to the system of equations Ax = b (where A is is full by multiplexing several independent problems into
an N × N matrix, x and b are N × 1 vectors) by performing the device. As a result, it was chosen to set all floating point
a minimisation of the residual ||b − Axk ||2 in the two-norm components to work to their maximum latency, for an imple-
over a Krylov Subspace [13]. It will generally converge to mentation with a high throughput that could potentially op-
a very accurate solution without the need to calculate the erate on multiple problems simultaneously would generally
entire subspace and hence the subspace is built iteratively, be more useful than a small reduction in latency for a single
problem. To this end, it was assumed that in all situations
% Initialisation
there would be sufficient problems available to the circuit to
v0 = 0 ; v1 = b − Ax0
fill the pipeline. This assumption is valid in many problems
β1 = ||v1 ||2
requiring matrix inversion, for example for the control com-
η = β1
munity [12]. The number of independent problems required
γ0 = 1 ; γ1 = 1
to keep the pipeline busy is discussed in Section 3.4, where
σ0 = 0 ; σ1 = 0
it is shown that it approaches 4 for large problems.
w0 = 0 ; w−1 = 0
i = 1
3.2. Datapath
while η > ε
% Calculate Lanczos Vectors In order to ensure the clock frequency is as close as possible
vi = βvii to the maximum available for the floating point components,
α = viT Avi it was decided whenever possible to use a dedicated compo-
vi+1 = Avi − αvi − βvi−1 nent to perform each operation as this minimises wiring and
βi+1 = ||vi+1 ||2 multiplexers between the floating point components. The
exception to this is for expensive operations. The square root
% Calculate QR Factors and division operators consume a large number of resources,
δ = qγi αi − γi−1 σi βi whilst as seen in the pseudo code (Fig. 1), they are only used
2 during four operations. Furthermore, if the vector division is
ρ1 = δ 2 + βi+1
calculated by a single inversion (to compute 1/β and 1/ρ1 )
ρ2 = σi αi + γi−1 γi βi followed by multiplication by the result, these two operators
ρ3 = σi−1 βi are only used twice per iteration per problem. Therefore it
is both possible and desirable to re-use these components.
% Calculate New Givens Rotations
γi+1 = ρδ1
σi+1 = ρβ1 3.3. Parallelisation

It is clear from the pseudo code (Fig. 1) that the calcula-


% Update Solution
tion of the Lanczos vectors is independent of the operations
wi = vi −ρ3 wi−2
ρ1
−ρ2 wi−1
to perform the QR decomposition, Givens Rotations and up-
xi = xi−1 + γi+1 ηwi dating the solution. Therefore it is possible for all these parts
η = σi+1 η of the circuit to work in parallel, and this reduces the overall
i = i+1 latency to be that of the Lanczos iteration.
end Theoretically, it is also possible to parallelise matrix and
vector operations, however, the limited number of resources
Fig. 1. MINRES Algorithm [14]. on an FPGA mean that for large N it is not possible to par-
allelise every operation. The operation of highest compu-
tational complexity is the Matrix × Vector multiplication
within the Lanczos iteration. Though a dedicated compo-
nent to perform this calculation would significantly reduce
the latency of the circuit, the resource usage would scale
heavily with N (Θ(N 2 ) in terms of multipliers and adders)
and also the I/O requirements for such an implementation
would quickly exceed the capabilities of the FPGA, making
it highly unscalable. Instead it was chosen to overlap the
Matrix × Vector operation in a pipelined fashion within a
dedicated VectorT Vector circuit. This involves a dedicated
vector multiplier and an adder sum tree (as shown in Fig. 2),
which has a cost of N multipliers and N − 1 adders, but re-
duces the latency to be Θ(N ) instead of Θ(N 2 ). In compar-
ison, if one were to parallelise the other vector operations,
it would only remove a constant latency at a cost of an in-
creased use of N operators. Furthermore, this V T V circuit
Fig. 2. Circuit Data Flow.
is re-used when calculating the norm, saving resources.
Using this parallelism, the total number of floating point 90
components is given in equation (1) and the overall latency Pipeline Depth
Asymptote
80
of the circuit is given by equation (2), where P is the number
of independent problems to be stored in the pipeline. Refer- 70

ring to (2), the factor 3N is a result of N cycles needed 60

Depth of Pipeline (P)


to perform the Matrix-Vector product in the pipeline as de- 50
scribed above, as well as 2N cycles for the two series to 40
parallel conversions (for vi and vi+1 ) which are inputted to
30
this circuit (Fig. 2); the factor c1 ⌈log2 N ⌉ is a result of the
summation tree in the V T V circuit (Fig. 2); and the factor 20

P is a result of the re-use of the V T V circuit for the norm 10


4
meaning that for one cycle per problem it will not be per- 0
0 50 100 150
forming a Matrix × Vector computation. The values c1 and Order of Matrix (N)

c2 are constants representing the latency of the other opera-


tions. Fig. 3. Plot of Pipeline Depth for increasing Matrix Order.
This is the minimum number of problems required for the
V T V circuit to always be in full operation.
Number of Floating Point Operators = 2N + 26. (1)

100

Total Latency (cycles) = 3N + c1 ⌈log2 N ⌉ + P + c2 . (2)


95

3.4. Pipelining
90
% Efficiency

As mentioned in Section 3.1, in order to maximise the effi-


85
ciency of the circuit, the pipelines in the floating point com-
ponents must continually be as full as possible, and this can
80
be achieved by multiplexing P problems into the system.
Due to the high resource usage of the V T V circuit (N mul- 75
tipliers and N − 1 adders), in order to maintain a high effi-
ciency, the minimum pipeline depth is chosen such that this 70
0 50 100 150
component is always in operation. Order of Matrix (N)

Using the V T V circuit described in Section 3.3, for P


problems this circuit will be in operation for P N +P cycles. Fig. 4. Plot of percentage Efficiency for increasing Matrix
Thus an effective way to determine the minimum pipeline Order using the Pipeline Depth (3).
depth is to match P N + P with the latency for one itera-
tion (equation (2)), for this ensures the V T V circuit will be
operating on other problems until it is again needed for the
subsequent iteration of the first problem. Using this method, tending to 100%. This growth in efficiency is due to the
the pipeline depth is given by equation (3). It should be number of operators for the V T V circuit increasing with
clear from equation (3) that the depth of the pipeline tends N , whilst the number of operators working on vectors and
to the value 4 as N tends to infinity, implying for large ma- scalars remains constant (equation (1)). This implies for
trices only a small number of independent problems are re- large N , the number of operators is dominated by the V T V
quired to keep the pipeline busy. The depth of the minimum circuit, and this will always be in operation by design. Such
pipeline found by this method is shown in Fig. 3. a performance is highly unlikely to occur in any software
implementation due to various delays such as cache misses.
  Indeed the efficiency of the order of 40 to 60 % is common
3N + c1 ⌈log2 N ⌉ + c2 in software, even for a highly optimized implementation, as
Pipeline Depth (P ) = . (3)
N shown in [15].
A graph illustrating the efficiency of the circuit using It should be noted that the efficiency reaches a minimum
this minimum pipeline is shown in Fig. 4. This demon- for N = 6; below this order, the total latency (equation (2))
strates that in all situations a high efficiency (above 70%) is small and hence any operators that work serially on vec-
is achieved and the efficiency increases with matrix order, tors are used relatively efficiently.
3.5. I/O Considerations 100
Slice Registers
90
The major consideration with regard to I/O is to ensure the DSP48Es
BRAMs
V T V circuit will continually have input data. As a result of 80

using single precision floating point representation (requir- 70

ing 32 bits) and the limited off chip I/O bandwidth in typical

% Resources
60

FPGA computing platforms, all elements of the A matrix 50

cannot be loaded in parallel. Instead the A matrix is held in 40


on-chip RAM (as shown in Fig. 2), organised as a parallel 30
bank of RAMs, each storing a column of the matrix for P 20
problems.
10
The A matrix for a given problem is re-used during each
0
iteration and hence the I/O requirement is determined by the 0 50
Order of Matrix (N)
100 150

need to be able to load the set of A matrices into the FPGA


for the next set of P problems within the time period to solve
the first P problems. It is important to note that this method Fig. 5. Plot of percentage Resource Usage on a Virtex 5 LX
requires the RAMs to be twice as large as necessary for any 330T for increasing Matrix Order.
single iteration (half of the RAM loads the next set of data
whilst the other half is in use). growth in floating point units is linear (equation (1)), and
It can be shown that the maximum number of iterations this design is dominated by floating point components.
for a given problem to reach a solution is N , but the method The BRAM usage, as seen in Fig. 5, grows in a piece-
will generally converge before this worst case. Denoting wise linear fashion, with occasional jumps. This is a result
the number of iterations executed as I (where I ≤ N ) and of storing the N columns of the A matrix for P problems
considering the latency for one iteration, after matching for (which translates to N RAMs each storing P N elements)
pipeline depth (Section 3.4), to be P N + P ; then the total dominating the BRAM use. Thus linear growth is caused
time available to load the data is I(P N + P ). The total by the number of columns increasing with N , whilst the
amount of data transferred is the A matrices, the two vectors large jumps occur when P N exceeds the physical sizes of
b and x0 , and the final output vector xout for P problems; a the BRAM. Together, this corresponds to a quadratic growth
total of P (N 2 + 3N ) elements. Thus the I/O requirement is asymptotically, but for orders up to 145, this is not signifi-
given by equation (4a). cant and is only seen as three jumps. The reason the BRAM
usage is not monotonically increasing with matrix order is
due to the decreasing pipeline depth (3) reducing the num-
P (N 2 + 3N )
I/O Req = words/cycle. (4a) ber of A matrices that must be stored.
I(P N + P )
N
≈ words/cycle. (4b) 4.2. Performance
I
N The maximum clock frequency reported after place and route
= 1.1 GBytes/s. (4c)
I is approximately 250 MHz for small matrix orders (N ≤
16), after which the speed slowly degrades with N , approx-
In order to consider this in terms of available I/O tech- imately in a linear fashion, to about 170MHz for the largest
nology, this is also shown as Bytes/second (4c), using the matrix orders. This high performance is likely to be a result
clock frequency (Section 4.2). While I is data dependent of the simple datapath as described in Section 3.2, combined
in general, this I/O bandwidth is, in our experiments, well with the deep pipelining allowing a large degree of retiming
below that provided by typical FPGA computing platforms, freedom, whilst the degredation is simply likely to be a re-
such as PCI-express (8 GBytes/s). sult of the increased size of the circuit requiring increased
wiring. Given this frequency, the maximum matrix order of
4. RESULTS 145, the number of floating point operators given in equation
(1) and the efficiency of 96% (Fig. 4), it is possible for this
4.1. Resource usage circuit to achieve a sustained performance of 53 GFLOPS.
As has been demonstrated in Sections 3.3 and 3.4 this
The circuit was placed and routed, targeted to the Virtex5 hardware implementation involves significant parallelism to
LX 330T. Fig. 5 shows the resource use in terms of DSP48Es, reduce the latency of the iteration, and also works upon mul-
slices and BRAMs. The growth of slices and DSP48Es with tiple problems in a pipelined fashion. In order to quantify
matrix size is highly linear. This is to be expected, for the this improvement, the performance of the hardware is com-
10
8 6. REFERENCES
CPU theoretical peak
FPGA Place & Route
[1] K. Underwood, “FPGAs vs. CPUs: trends in peak floating-
point performance,” in Proc. Int. Symp. Field Programmable
Gate Arrays, 2004, pp. 171–180.
MINRES iterations/Second

7
10
[2] P. Salmela, A. Happonen, A. Burian, and J. Takala, “Several
approaches to fixed-point implementation of matrix inver-
sion,” Proc. Int. Symp. Signals, Circuits and Systems, vol. 2,
10
6 pp. 497–500, July 2005.
[3] G. de Matos and H. Neto, “On reconfigurable architectures
for efficient matrix inversion,” Proc. Int. Conf. Field Pro-
grammable Logic and Applications, pp. 369–374, Aug. 2006.
5
10
10
0 1
10
2
10
3
10 [4] K. Turkington, K. Masselos, G. Constantinides, and P. Leong,
Order of Matrix (N)
“FPGA based acceleration of the LINPACK benchmark: A
high level code transformation approach,” Proc. Int. Conf.
Fig. 6. Comparison of Hardware and Software Performance. Field Programmable Logic and Applications, pp. 375–380,
Aug. 2006.
[5] X. Wang and M. Leeser, “Efficient FPGA implementation
pared to the peak theoretical performance of a software im- of QR decomposition using a systolic array architecture,”
plementation [15]. The performance metric is MINRES it- in Proc. 16th Int. Symp. Field Programmable Gate Arrays,
erations per second. 2008, p. 260.
The software model is based upon the peak theoretical [6] L. Zhuo and V. K. Prasanna, “High performance linear
floating point performance of a Pentium IV running at 3.0 algebra operations on reconfigurable systems,” in Proc.
GHz (6 GFLOPS) [15], applied to the number of floating ACM/IEEE Conf. Supercomputing. Washington, DC, USA:
point operations given in equation (5) which is found by a IEEE Computer Society, 2005, p. 2.
simple operation count of the algorithm described in Fig. [7] Y. Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydad-
1. The hardware model assumes a pipeline depth given by jiev, “64-bit floating-point FPGA matrix multiplication,” in
equation (3) and an operational frequency of given by the Proc. 13th Int. Symp. Field-programmable gate arrays, 2005,
place and route results. pp. 86–95.
[8] J. Nocedal and S. J. Wright, Numerical Optimization. New
York, USA: Springer, 1999.
#Floating Point Operations = 2N 2 + 15N + 14. (5) [9] G. R. Morris, V. K. Prasanna, and R. D. Anderson, “A
hybrid approach for mapping conjugate gradient onto an
A comparison of these two models is shown in Fig. 6. FPGA-augmented reconfigurable supercomputer,” in Proc.
This shows that as a result of the parallelism reducing the 14th IEEE Symp. Field-Programmable Custom Computing
latency, the performance is greater than a software imple- Machines, 2006, pp. 3–12.
mentation even for orders as low as N = 7. Due to the [10] A. Lopes and G. Constantinides, “A high throughput FPGA-
increased efficiency and parallelism, this performance im- based floating point conjugate gradient implementation,” to
provement grows to almost an order of magnitude over the appear in Proc. Applied Reconfigurable Computing, 2008.
peak theoretical performance of a software implementation. [11] C. C. Paige and M. A. Saunders, “Solution of sparse indef-
inite systems of linear equations,” SIAM, vol. 12, no. 4, pp.
5. CONCLUSION 617–629, Sept 1975.
[12] K. V. Ling, J. M. Maciejowski, and W. B. F., “Multiplexed
This paper has demonstrated that the MINRES algorithm model predictive control,” Proc. 16th IFAC World Congress,
can be used effectively as a means to solve a system of lin- July 2005.
ear equations in hardware. It has discussed in detail sev- [13] G. H. Golub and C. F. V. Loan, Matrix computations (3rd
eral design decisions and described how significant paral- ed.). Baltimore, MD, USA: Johns Hopkins University Press,
lelism has been used to reduce the latency of the circuit by a 1996.
factor of N and through pipelining it is possible to achieve [14] B. Fisher, Polynomial Based Iteration Methods for Symmet-
an efficiency which will tend to 100%, with values of 96% ric Linear Systems. Baltimore, MD, USA: Wiley, Teubner,
achieved in practice. Finally, it has described a circuit using 1996.
these considerations that exhibits a sustained performance of [15] J. J. Dongarra, “Performance of various computers using
53 GFLOPs, which is superior to previous work and predicts standard linear equations software.” [Online]. Available:
a performance improvement of nearly an order of magnitude www.netlib.org/benchmark/performance.pdf, Accessed on
compared to the peak theoretical software implementation. 10/03/2008.

You might also like