Minres Algorithm FPGA
Minres Algorithm FPGA
definition. Using off-chip RAM to hold the intermediate re- using the Lanczos process [11]. Overall, the pseudo code is
sults enables larger matrices to be held, but the I/O require- described in Fig. 1, with the major sections of the algorithm
ments to load onto the FPGA creates a bottle-neck (typically highlighted.
determined by the number of off-chip RAMs) upon perfor- The Conjugate Gradient Method can also be interpreted
mance and efficiency, typically leading to low speedups, as as an algorithm that makes use of the Lanczos process and
shown in Table 1 for [3] and [4]. The embedded memories therefore there are some similarities between the two meth-
on modern FPGA devices, however, now have the capac- ods [13]. For cases where it is desirable to compare hard-
ity to buffer large matrices on chip, a technique we exploit ware implementations of these two methods, it is important
to break this bottleneck, at the cost of limiting the order of to highlight the two major differences in terms of hardware
matrix to up to 145. This order of matrix is considered rel- costs, both of which are a result of working with the two-
atively large for dense problems and is sufficient for many norm. Firstly normalisation is required, resulting in square
applications which depend upon matrix inversion [12], and root operations. Secondly, it results in a three-term recur-
could be used as a building block for solving larger systems. rence as opposed to the two-term recurrence in the Con-
Moreover, it is 2 to 12 times larger than previous dense on- jugate Gradient algorithm; this increases storage require-
chip solvers. ments. Thus the MINRES algorithm trades an increase in
The main contributions of this paper are: circuit complexity for the ability to solve a wider class of
problems.
• A demonstration of the suitability of the MINRES al-
gorithm for use on an FPGA,
3. IMPLEMENTATION
• An analysis of the design decisions and trade-offs in-
volved to create an optimum floating point implemen- The optimal hardware implementation is dependent upon a
tation of the MINRES algorithm in hardware, includ- number of factors - number of resources, latency (in terms
ing efficiency and pipeline depth, of cycles per iteration), throughput and efficiency (in terms
• A design for solving multiple dense systems of linear of the amount of time resources are in use). The design
equations in a pipeline for orders up to 145 using the described aims to achieve a good balance in terms of opti-
MINRES algorithm, with results demonstrating a sus- mising these factors. The following sections detail the main
tained performance, taking into account I/O overhead, considerations and potential trade-offs between these factors
of up to 53 GFLOPS. and justifies any major decisions. To aid description for this
section, a diagram of the circuit is shown in Fig. 2.
This paper describes the MINRES algorithm and briefly
highlights both the advantages and the additional complex-
3.1. Floating Point Units
ities over its closest relative, the Conjugate Gradient algo-
rithm, in Section 2. The hardware implementation is de- Xilinx Core Generator was used to generate the floating point
scribed in Section 3, along with its associated results in Sec- components for the circuit. This environment enables the
tion 4. Finally Section 5 concludes this paper. user to trade latency for maximum clock frequency. Pro-
vided the pipeline remains full, due to the increased clock
2. MINRES ALGORITHM frequency, a component with a higher latency potentially has
a higher throughput.
The MINRES algorithm finds a (potentially approximate) It is possible to maximise the amount of time the pipeline
solution xk , to the system of equations Ax = b (where A is is full by multiplexing several independent problems into
an N × N matrix, x and b are N × 1 vectors) by performing the device. As a result, it was chosen to set all floating point
a minimisation of the residual ||b − Axk ||2 in the two-norm components to work to their maximum latency, for an imple-
over a Krylov Subspace [13]. It will generally converge to mentation with a high throughput that could potentially op-
a very accurate solution without the need to calculate the erate on multiple problems simultaneously would generally
entire subspace and hence the subspace is built iteratively, be more useful than a small reduction in latency for a single
problem. To this end, it was assumed that in all situations
% Initialisation
there would be sufficient problems available to the circuit to
v0 = 0 ; v1 = b − Ax0
fill the pipeline. This assumption is valid in many problems
β1 = ||v1 ||2
requiring matrix inversion, for example for the control com-
η = β1
munity [12]. The number of independent problems required
γ0 = 1 ; γ1 = 1
to keep the pipeline busy is discussed in Section 3.4, where
σ0 = 0 ; σ1 = 0
it is shown that it approaches 4 for large problems.
w0 = 0 ; w−1 = 0
i = 1
3.2. Datapath
while η > ε
% Calculate Lanczos Vectors In order to ensure the clock frequency is as close as possible
vi = βvii to the maximum available for the floating point components,
α = viT Avi it was decided whenever possible to use a dedicated compo-
vi+1 = Avi − αvi − βvi−1 nent to perform each operation as this minimises wiring and
βi+1 = ||vi+1 ||2 multiplexers between the floating point components. The
exception to this is for expensive operations. The square root
% Calculate QR Factors and division operators consume a large number of resources,
δ = qγi αi − γi−1 σi βi whilst as seen in the pseudo code (Fig. 1), they are only used
2 during four operations. Furthermore, if the vector division is
ρ1 = δ 2 + βi+1
calculated by a single inversion (to compute 1/β and 1/ρ1 )
ρ2 = σi αi + γi−1 γi βi followed by multiplication by the result, these two operators
ρ3 = σi−1 βi are only used twice per iteration per problem. Therefore it
is both possible and desirable to re-use these components.
% Calculate New Givens Rotations
γi+1 = ρδ1
σi+1 = ρβ1 3.3. Parallelisation
100
3.4. Pipelining
90
% Efficiency
ing 32 bits) and the limited off chip I/O bandwidth in typical
% Resources
60
7
10
[2] P. Salmela, A. Happonen, A. Burian, and J. Takala, “Several
approaches to fixed-point implementation of matrix inver-
sion,” Proc. Int. Symp. Signals, Circuits and Systems, vol. 2,
10
6 pp. 497–500, July 2005.
[3] G. de Matos and H. Neto, “On reconfigurable architectures
for efficient matrix inversion,” Proc. Int. Conf. Field Pro-
grammable Logic and Applications, pp. 369–374, Aug. 2006.
5
10
10
0 1
10
2
10
3
10 [4] K. Turkington, K. Masselos, G. Constantinides, and P. Leong,
Order of Matrix (N)
“FPGA based acceleration of the LINPACK benchmark: A
high level code transformation approach,” Proc. Int. Conf.
Fig. 6. Comparison of Hardware and Software Performance. Field Programmable Logic and Applications, pp. 375–380,
Aug. 2006.
[5] X. Wang and M. Leeser, “Efficient FPGA implementation
pared to the peak theoretical performance of a software im- of QR decomposition using a systolic array architecture,”
plementation [15]. The performance metric is MINRES it- in Proc. 16th Int. Symp. Field Programmable Gate Arrays,
erations per second. 2008, p. 260.
The software model is based upon the peak theoretical [6] L. Zhuo and V. K. Prasanna, “High performance linear
floating point performance of a Pentium IV running at 3.0 algebra operations on reconfigurable systems,” in Proc.
GHz (6 GFLOPS) [15], applied to the number of floating ACM/IEEE Conf. Supercomputing. Washington, DC, USA:
point operations given in equation (5) which is found by a IEEE Computer Society, 2005, p. 2.
simple operation count of the algorithm described in Fig. [7] Y. Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydad-
1. The hardware model assumes a pipeline depth given by jiev, “64-bit floating-point FPGA matrix multiplication,” in
equation (3) and an operational frequency of given by the Proc. 13th Int. Symp. Field-programmable gate arrays, 2005,
place and route results. pp. 86–95.
[8] J. Nocedal and S. J. Wright, Numerical Optimization. New
York, USA: Springer, 1999.
#Floating Point Operations = 2N 2 + 15N + 14. (5) [9] G. R. Morris, V. K. Prasanna, and R. D. Anderson, “A
hybrid approach for mapping conjugate gradient onto an
A comparison of these two models is shown in Fig. 6. FPGA-augmented reconfigurable supercomputer,” in Proc.
This shows that as a result of the parallelism reducing the 14th IEEE Symp. Field-Programmable Custom Computing
latency, the performance is greater than a software imple- Machines, 2006, pp. 3–12.
mentation even for orders as low as N = 7. Due to the [10] A. Lopes and G. Constantinides, “A high throughput FPGA-
increased efficiency and parallelism, this performance im- based floating point conjugate gradient implementation,” to
provement grows to almost an order of magnitude over the appear in Proc. Applied Reconfigurable Computing, 2008.
peak theoretical performance of a software implementation. [11] C. C. Paige and M. A. Saunders, “Solution of sparse indef-
inite systems of linear equations,” SIAM, vol. 12, no. 4, pp.
5. CONCLUSION 617–629, Sept 1975.
[12] K. V. Ling, J. M. Maciejowski, and W. B. F., “Multiplexed
This paper has demonstrated that the MINRES algorithm model predictive control,” Proc. 16th IFAC World Congress,
can be used effectively as a means to solve a system of lin- July 2005.
ear equations in hardware. It has discussed in detail sev- [13] G. H. Golub and C. F. V. Loan, Matrix computations (3rd
eral design decisions and described how significant paral- ed.). Baltimore, MD, USA: Johns Hopkins University Press,
lelism has been used to reduce the latency of the circuit by a 1996.
factor of N and through pipelining it is possible to achieve [14] B. Fisher, Polynomial Based Iteration Methods for Symmet-
an efficiency which will tend to 100%, with values of 96% ric Linear Systems. Baltimore, MD, USA: Wiley, Teubner,
achieved in practice. Finally, it has described a circuit using 1996.
these considerations that exhibits a sustained performance of [15] J. J. Dongarra, “Performance of various computers using
53 GFLOPs, which is superior to previous work and predicts standard linear equations software.” [Online]. Available:
a performance improvement of nearly an order of magnitude www.netlib.org/benchmark/performance.pdf, Accessed on
compared to the peak theoretical software implementation. 10/03/2008.