0% found this document useful (0 votes)

54 views8 pages

Christen 07

The document discusses using graphics processing units (GPUs) as co-processors to accelerate two computational kernels: sparse direct linear factorization and nonlinear interior-point optimization. These kernels are important for applications in fields like physics modeling and optimization. The authors implemented these kernels on NVIDIA GPUs using the CUDA platform and analyzed their performance on an NVIDIA GeForce 8800 GPU. They achieved over 110 GFlops/s for large matrix operations, demonstrating that GPUs can provide useful acceleration for scientific applications.

Uploaded by

bernasek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views8 pages

Christen 07

Uploaded by

bernasek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

1

General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform
Matthias Christen, Olaf Schenk, Member, IEEE, and Helmar Burkhart, Member, IEEE
Abstract We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel oatingpoint co-processors to accelerate two fundamental computational scientic kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify e.g. the matrix-matrix multiplication as a rst natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip. We exploit the architectural features of the GeForce 8800 GPU to design an efcient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientic applications. Index Terms GPGPU, graphical processing units, sparse matrix decomposition, sparse direct solvers, large-scale nonlinear optimization

I. I NTRODUCTION RAPHICS processing units (GPUs) have evolved into a very attractive hardware platform for general purpose computations due to their extremely high oating-point processing performance, huge memory bandwidth and their comparatively low cost [1]. The rapid evolution of GPUs in performance, architecture, and programmability can provide application potential beyond their primary purpose of graphics processing. High-end GPUs [2] or the STI Cell-processors [3], which are integrated in the Sony PlayStation 3, typically deliver performance of at least one order of magnitude higher compared to that of the CPU, while at the same time equipped up to 1 GB of GPU main memory. This commodity graphics hardware can become a cost-effective, highly parallel platform to solve scientic problems. In this paper we present an extensive matrix algorithmic performance study on GPUs using the novel NVIDA CUDA technology platform to build general-purpose sparse matrix building blocks. Following the current trend to perform computationally intensive operations on a specialized processor rather than on the CPU, we will use a GPU as a mathematical co-processor to accelerate sparse direct linear solvers [4], [5], [6]. Our stream computing unit is based on the NVIDIA GeForce 8800 which features a scalable ultra-threaded architecture, high performance parallel processing on 128 shader processors and is equipped with 768 MB on-board memory. Our primary goal is to investigate the performance acceleration of dense and sparse matrix solution kernels. These matrix linear algebra algorithms are of importance and represent fundamental kernels in many computationally intensive scientic applications

such as nonlinear optimization, computer tomography, geophysical seismic modelling, semiconductor device simulations, and in the solution of partial differential equations in general. The performance of all these applications rely heavily on the availability of fast sparse matrix solution kernel routines on CPUs or GPUs. There are many algorithms for factoring large sparse linear systems of equations. Since the early 1990s, it is clear that exploiting cache memories has become crucial for achieving high performance in sparse matrix factorization [7], [8]. The key is to group consecutive columns with identical nonzero structure together in order to exploit cache memories in sparse matrix factorizations. Multifrontal and supernodal codes have been developed that can effectively exploit the memory hierarchies of cache-based microprocessors. With the right data-structure, the vast majority of oating-point operations can be performed within highly-tuned BLAS 3 operations [9] such as the matrixmatrix multiplication routines and near peak performance can be expected on modern architectures. For a recent detailed survey on sparse matrix techniques for large linear systems of equations, the interested reader should consult [10]. A. Contributions We map two fundamental computational kernels as generalpurpose sparse matrix building blocks onto the GPU: a sparse direct linear factorization method for nonsymmetric and symmetric indenite matrices based on the PARDISO framework [4], [5], [6] and an interior-point optimization solver for largescale nonconvex PDE-constrained optimizations [11]. Both are workhorses of physical modeling and optimization applications. We analyze their performance on NVIDIAs GeForce 8800 in realistic large-scale applications. B. Organization The remainder of the paper is organized as follows: In Section II we present a brief overview of related work. In Section III we will investigate the parallel performance on GPUs for several dense matrix computational kernels that arise in sparse matrix factorizations. We then present our algorithmic design to parallelize sparse matrix factorizations on the GPU and discuss strategies to optimize the GPU performance in Section IV. In Section V we use the GPU to accelerate large-scale nonconvex interior-point optimizations problems that arise e.g. in PDEconstrained optimizations. II. R ELATED W ORK A. Scientic Computations on GPUs There have been several GPU-parallel implementations and investigations of dense matrix kernels on emerging multiprocessing architectures such as GPUs and the STI Cell processor. Galoppo et

M. Christen, O. Schenk and H. Burkhart are with the Computer Science Department of the University of Basel, Basel, Switzerland.

al. [12] analyzed the peak performance of a cache and bandwidth efcient GPU solver for dense matrix decompositions. Dongarra al. [13] propose to exploit single-precision operations whenever possible and resort to double-precision at critical stages while attempting to provide the full double precision results and present results for dense matrices on the IBM Cell processor. Several GPU-based algorithms for sparse matrix multiplication on emerging architectures have been proposed e.g. in [14], [15], [16], [17]. These are all iterative methods such as conjugate gradient and multigrid solvers as described e.g. by Goeddeke et. al. [14]. Our work is related in spirit to these frameworks for linear algebra kernels and nonlinear optimizations on these architectures. However, there is little research on using GPUs for large-scale sparse direct factorizations or interior-point methods for nonlinear problems although these two algorithmic kernels represent workhorses in scientic computations. III. B ASIC L INEAR A LGEBRA K ERNELS ON CPU S AND GPU S N VIDIA provides a highly tuned library containing basic linear algebra routines, C UBLAS. C UBLAS is a high-level API designed for compatibility with the original F ORTRAN subprograms. It is built solely on top of C UDA . The entire set of single precision real B LAS routines [9] are available through C UBLAS as well as some single precision complex functions. We performed benchmarks for the computationally intensive routines of interest in respect of the sparse direct linear solver PARDISO [4], [5], [6], namely the matrix-matrix multiplication (sgemm), the solving of a triangular matrix equation (strsm), and the LU and LDLT decomposition (sgetrf, ssytrf). Unfortunately, the factorization routines being a L APACK rather than a B LAS routine, are not part of C UBLAS. We have therefore implemented it ourselves using existing C UBLAS functions. Although therefore not being ne-tuned, it yields reasonable results for large matrices. We have compared the performance results with benchmarks using the Intel Math Kernel Library (MKL) done on a dual-core 3.4 GHz Intel Pentium D CPU, which has 16 KB of L1 cache and 2 MB of L2 cache. A. Matrix-Matrix Multiplication Fig. 1 and Fig. 2 show the result of the 32-bit sgemm benchmark. The performance of an optimized CPU version and two versions of the C UDA implementation of the matrix-matrix multiplication A B have been measured, where A Rmk and B Rkn . The two plots in Fig. 1 display the performance results of the CPU using the Intel MKL 9.1 library, whereas the plots in Fig. 2 show the GPU performance results. The rst GPU performance measurement, which is displayed in the rst row, includes the data transfer to the GPU before the multiplication as well as the data transfer back to the CPU system after the computation. The second measurement found in the bottom row omits GPU data transfers. For the benchmarks, m and n were varied, whereas k was xed at 50 for the two plots in the left and at 4096 for the plots in the right column. The parameters m and n vary along the horizontal and vertical axes, respectively. The color shades correspond to the performance. Note that the two columns and rows use different color scales.

The plots show that for large matrices the GPU outperforms the CPU by nearly an order of magnitude: the CPU performs the multiplication at a relative constant rate of 12 G FLOP/s, while the GPU reaches a performance of more than 100 G FLOP/s for large matrices when omitting data transfers. Unfortunately, multiplying matrices sized 200 200 and below has the contrary effect. The multiplication of small matrices is carried out faster by the CPU even if no data transfer to and from the GPU system is involved. This is due to the fact that the multiplication on the GPU is carried out by many threads in parallel, which require some startup overhead. B. Triangular Matrix Equation With Multiple Right-Hand-Sides Another important dense matrix routine for sparse direct linear solvers is the L APACK strsm method. In this section we will evaluate the GPU performance of strsm for quadratic matrices. The C UBLAS implementation achieves an impressive performance of up to 70 G FLOP/s for large matrices. Again, the CPU performs at a relative constant rate of circa 10 G FLOP/s and hence, the GPU strsm is almost an order of magnitude faster than the GPU strsm for large quadratic matrices. The plot again shows two measurement versions of the computation performance on the GPU, once without taking data transfers into account and once including the GPU up- and downloads, which entails a constant performance penalty of circa 10 G FLOP/s. Also, two versions of the CPU implementation of the solver are depicted. The upper curve being the performance of a single precision (32-bit) and the lower curve that of a double precision (64-bit) solver. It is demonstrated that a performance gain of factor 2 is achieved when precision is reduced from 64 to 32 bit. The plot at the right of Fig. 3 is a zoom into the lower left corner of the plot at the left to identify the the computational cross-over point. It shows that for matrix sizes as small as from 150 150 the GPU outperforms the CPU in our hardware conguration. C. Dense Linear Factorization Solvers As already mentioned, sgetrf is not part of C UBLAS. The following benchmark has been done using a one-to-one translation of the original L APACK F ORTRAN code to a code using existing C UBLAS functions. As in L APACK , both blocked and non-blocked versions have been implemented. Fig. 4 displays the time used by the constituents of the blocked LU decomposition that are computationally most intensive. Also, the total time consumed by the factorization is shown. The horizontal axis represents the block size, the vertical axis shows the time used by the decomposition. The blocked LU decomposition consists of three computational main components, the matrixmatrix multiplication (sgemm), the triangular solve (strsm), and a rank 1 update A := x yT + A (sger). The Figure suggests that nearly no time (in relation to the other constituents) is consumed by strsm, which is the bottom most curve. For small blocks, many matrix-matrix multiplications are required, which then dominate the calculation. As the block size increases fewer matrix-matrix multiplications are required, and also with larger matrix blocks, the performance of the multiplication increases as outlined in Section III-A. The time used by sger almost increases linearly with block size. Other function calls to B LAS level 1 routines are omitted in the plot as they consume a constant amount of time.

MatrixMatrix Multiplication AB=C: CPU Performance in MFLOP/s, k=50

MatrixMatrix Multiplication AB=C: CPU Performance in GFLOP/s, k=4096 2000 110 100
12.5
12

4000

5000

120

6000

1800 1600 m = number of rows of A and C 1400 1200 1000 800 600 400

2000

5000 4000

3000

5000

6000

7000

3000

100

6000

90 80

2 0 00

m = number of rows of A and C

7000

6000

6000 4000

6000

5000

6000

5000

5000 4000

6000

4000

5000

.5
70 60

4000

6000 6000

4000 6000

6000

9
50
10.5 10 9.5

3000

6000

4000

4000
3000 2000

5000

2000

5000

4000

12
12
12

30 20 10 1600 1800 2000

500

1000

0
0 400
120

30 00

500

200 200 400

Fig. 1.

Performance measurements for 32-bit sgemm on the CPU for xed k = 50 (left) and k = 4096 (right).

MatrixMatrix Multiplication AB=C: GPU (with read/write) Performance in MFLOP/s, k=50

500 1000

100 0
120 100

4000

0
100

11.5

40 60 80 n = number of columns of B and C

600 800 1000 1200 1400 n = number of columns of B and C

2000
30

MatrixMatrix Multiplication AB=C: GPU (with read/write) Performance in GFLOP/s, k=4096 90

110 100 90

7000

1800 1600 m = number of rows of A and C 1400 1200 1000

6000

m = number of rows of A and C

30 40

150 0

25 00

20 00
5000

80 70 60

80
25

3000
00

4000

60
500

150

200

3000

800
40 30
50

80
70

40 30 20 10

500

1500
1000

2000

600 400

70
60
60

20
500

1000 200
20

50
30 30 600 800 1000 1200 1400 1600 n = number of columns of B and C

500
20 40 60 80 n = number of columns of B and C 100 120

200

400

1800

2000

MatrixMatrix Multiplication AB=C: GPU (without read/write) Performance in MFLOP/s, k=50 500 600 70 0 0 00 120

MatrixMatrix Multiplication AB=C: GPU (without read/write) Performance in GFLOP/s, k=4096 2000 110

0 200

100

110

1000
100 m = number of rows of A and C 80

7000

1800 1600 m = number of rows of A and C 1400

100 90 80 70 60

6000
50
40 00

30
20 00

1200 1000 800

4000

5000

110

1000

0 10

1000
3000

100
40

0
100

40
90

200
0

2000

600 400

10
70

0
100
20 10 2000

20
10

1000
1000

200 100 120

40 5 0 200

90
80

90
70 1800

40 60 80 n = number of columns of B and C

400

80 70 600 800 1000 1200 1400 1600 n = number of columns of B and C

Fig. 2. Performance measurements for 32-bit sgemm GPU for xed k = 50 (left) and k = 4096 (right). The top row includes the read/write data transfer from CPU main memory into GPU memory, whereas the bottom row assumes that all data can be stored on GPU memory.

Performance of TRSM for large matrices 70

Performance of TRSM for small matrices GPU (without read/write), 32 bit GPU (with read/write), 32 bit CPU 32 bit CPU 64 bit

50
10

GFLOP/s

180 180 8 140 140 6

GPU (w/o read/write), 32 bit GPU (with read/write), 32 bit CPU, 32 bit

CPU, 64 bit

500

1000

1500

2000 2500 Matrix size (n)

3000

3500

4000

4500

100

150

200 Matrix size (n)

250

300

350

400

Fig. 3. Performance of triangular solve [d/s]trsm on CPU and GPU for large (left; n < 4096) and small (right; n < 256) n n matrices.

For our hardware and software environment, a block size of

32-by-32 has proven to give the best speed-up. This is due to the
LU decomposition of a 2048 2048 matrix with varying block sizes 1500 GPUSGER GPUSGEMM GPUSTRSM Total

facts that the C UBLAS implementations of sgemm and strsm yield best performances if the matrix sizes are divisible by 16 because of the GPUs memory organization; for larger block sizes less matrix-matrix multiplications and solves are required while the performance of the routines increase with matrix size; for all block sizes larger than a threshold value, the rank 1 update being a B LAS level 2 routine dominates the calculation. In the case of factoring a 2048-by-2048 matrix the C UBLAS implementation of sger merely reaches 1.3 G FLOP/s, whereas the performance of strsm is as high as 50 G FLOP/s. Choosing xed block sizes of 32-by-32 (Fig. 4), for matrices sized 3968 3968, a speed-up factor of 2.5 with respect to the optimized single precision implementation on the CPU is reached, as depicted in Fig. 5 D. Analysis and Computational Complexity The performance measurements show that in general the GPU version of a linear algebra kernel performs well, i.e. outperforms the CPU, if the kernels are applied to large matrices, whereas in the case of small matrices the performance of the CPU yields better results than that of the GPU implementation. This fact suggests that, if using the GPU as a hardware accelerator, for computations involving small matrices the CPU should be favored. In the scatter plot in Fig. 6 the number of oating point operations used for the matrix-matrix multiplication has been plotted against the performance rate. The number of operations appears on the horizontal axis, while the performance is plotted on the vertical axis. The data has been taken from Fig. 1 and plotted in a different fashion in order to nd the cross-over point where the performance of the GPU surpasses that of the CPU. There are three sets of data: the performance of the matrixmatrix multiplication on the GPU excluding and including data transfers and the CPUs performance of sgemm. Each dot in the Figure represents the sgemm performance of a matrix-matrix

1000

Time in ms

min. time 500

0 0

100

200

300 Block size

400

500

600

Inuence of block size for a 32-bit GPU factorization with our own CUDA-based sgetrf routine for a matrix of size n = 2000. The important components of sgetrf are: sger, sgemm and strsm.
Fig. 4.
Performance of GETRF 30 GPU (w/o read/write) GPU (with read/write) 25 CPU

GFLOP/s

1700 1700 15

500

1000

1500

2000 2500 Matrix size (n)

3000

3500

4000

4500

Fig. 5. Performance of dense linear factorization sgetrf on CPU and GPU.

30 GPU w/o read/write GPU with read/write CPU 25

2.4 106 operations

7 106 operations

6 FLOP

14 x 10
6

Number of operations for a matrix-matrix multiplication plotted against the performance of CPU and GPU versions of the computation kernel.
Fig. 6.

multiplication A B, where A Rmk and B Rkn and m, n 240 and k = 120. The plot indicates that for our hardware and software conguration the cross-over point is at 7 106 oating-point operations if all sgemm calls are simply to be substituted by a corresponding function call involving the download of the matrices to the GPU, the multiplication on the GPU and the retrieval of the result. If a more code-invasive approach is chosen and the data happens to be already present in GPU memory at the time of the multiplication, the lower threshold value of 2.4 106 operations could be chosen for optimal results. The cross-over point for strsm lies below the matrix size of 150 150. Fig. 5 suggests that the C UBLAS enhanced version of sgetrf should only be used if the matrices become as large as 1600 1600. In order to further improve the performance of the dense solver, a proper C UDA kernel should be implemented and optimized. IV. G ENERAL -P URPOSE S PARSE D IRECT L INEAR S OLVERS ON THE GPU In this section we concentrate on enhancing the parallel direct solver PARDISO [4], [5], [6] with GPU code specically by using the C UBLAS library. PARDISO is a sparse direct solver for large systems of linear equations, which has been developed at the University of Basel and is part of the Intel Math Kernel Library [4]. According to [18] it is currently among the fastest direct solver available for sparse linear systems. PARDISO uses a supernodal approach to the matrix factorization. A supernode is collection of adjacent matrix columns such that the sparsity pattern below the diagonal block is identical in each column. In order to arrive at a matrix structure partitioned into supernodes, a permutation matrix is applied to the original matrix. The motivation to use supernodes is twofold: rstly, since supernodes could be considered as dense matrices that scattered into the matrix to factor to use highly optimized dense highperformance B LAS Level 3 routines. The second observation is on the algorithmic level. While eliminating variables from a sparse

system it is desirable to maintain the sparsity of the system, i.e. to minimize ll-in. One strategy e.g. for sparse matrices is to use the minimal degree algorithm, which picks the variables to eliminate by the degree of the node (i.e. the number of adjacent nodes) corresponding to the variable when the matrix is viewed as the adjacency matrix of a graph. It could be shown that the minimum degree algorithm assigns the same priority to all of the rows viewed as nodes of the graph belonging to a supernode. Once the supernode structure is given by means of a permutation matrix, the basic structure of the supernodal factorization algorithm is as follows: The algorithm in PARDISO implements a left-looking factorization procedure. The matrices involved in the algorithm are schematically depicted in Fig. 7 and Fig. 8. In fact, the supernodes Li , Ui are sparse block matrices with hopefully many zero rows. By omitting all the zero rows the blocks could be thought of as dense matrices with which Level-3 linear algebra operations can be performed. The main work consists of three computationally intensive dense linear algebra operations: the matrix-matrix multiplications d/sgemm in lines 3 and 4, the LU decomposition d/sgetrf in line 6, and two triangular solve with d/strsm in lines 8 and 9.

GFLOP/s

A. GPGPU Strategy I Sparse Factorization using C UBLAS Kernels As a minimally invasive approach to enhancing the sparse solver with a GPU, we identied all the computationally intensive linear algebra routines in the solver core schematically outlined in Figure 8, and replaced them by a code switching to the GPU equivalent when the number of oating-point operations exceeds the threshold value of 7 106 determined in the previous section. Alone this little effort resulted in a considerable speedup of the numerical solving process An acceleration of up to 7 compared to 64-bit CPU can be achieved using the 32-bit GPU on the CUDA platform.

B. GPGPU Strategy II Mapping Linear Algebra Kernels to the GPU The other extreme of strategy I, is to adapt the entire numerical factorization code, i.e. the sparse LU decomposition, to the GPU. In this case, ideally the entire matrix would be transferred to the GPU once and be retrieved by the CPU system once the LU or LDLT decomposition is completed. This, however, works only for matrices with relatively few non-zero elements in the triangular factors, since the GPU memory of the GeForce 8800 is currently limited to 768 MB. A solution to this problem is to develop a hybrid strategy that transfers data to the GPU only if it is currently needed and keeps it on the GPU as long as possible for reuse. Recall from Fig. 7 and Fig. 8 and that a supernode is updated by the supernodes to its left, which have potentially already been loaded from the CPU. When running out of memory it should be examined whether supernodes currently loaded into GPU memory could be truncated (in the updating process not the entire supernodes are used) or if certain supernodes have to be discarded. Preliminary tests have shown that this strategy could further speed up the solver considerably. The implementation, however, is still work in progress.

d Uk d d Uk d Lk d r d r
L k L dj d Lj d

1 2
Uk

for j = 0 to # supernodes do for k = 0 to j 1 do

Lj Lj Lk Uk Uj Uj Uk L k

3 4 5

end
L LU decomposition of L with partial pivoting j j apply the pivots to Uj

L k

d rr r d rL j d d L j r rL d j d d

6 7 8 9 10

solve X Lj = L for X , L X j j solve X Lj = Uj for X , Uj X end

A step in the left-looking factoring algorithm used by PARDISO to factor structurally symmetric matrices: The supernodes Lj and Uj receive updates from Lk and Uk while being factored.
Fig. 7.

Fig. 8.

Supernodal LU decomposition for non-symmetric matrices.

C. GPU/CPU Sparse Factorization for Non-Symmetric Matrices This section gives an overview of the non-symmetric matrices that are used for the numerical experiments. Some general information about the matrices is given in Table I. Those marked with a (SD) in the column source are from semiconductor device simulation. Others, labeled with (UF) are from a public sparse matrix collection [19] and all other matrices are either from automobile crash simulation (AF) or electromagnetic wave simulation (EM). The Table lists the number of unknowns in the matrix, the number of nonzero elements and the number of oating-point operations in GFlops to factor the matrix.
TABLE I G ENERAL INFORMATIONS AND STATISTICS OF THE NONSYMMETRIC MATRICES USED IN THE NUMERICAL EXPERIMENTS . name iis-para-19 iis-para-14 para-10 barrier2-9 S-1300 747-200k 747-400k unknowns 155,924 155,924 155,924 115,625 7,800 198,098 402,902 elements 8,374,204 8,374,204 2,094,873 2,158,759 40,560,000 3,043,006 6,231,526 GFlops 444 444 336 198 123 72 246 source (SD) (SD) (UF) (UF) (AF) (EM) (EM)

labeled with (UF) are from the public sparse matrix collection [19] and those labeled with (CP) arise in computational physics.
TABLE II G ENERAL INFORMATIONS AND STATISTICS OF THE SYMMETRIC INDEFINITE MATRICES USED IN THE NUMERICAL EXPERIMENTS . name stokes1 stokes2 ldoor crankseg 1 cop300k kkt b anderson-50 af shell9 bmw3 2 anderson-80 unknowns 307,995 505,940 952,203 52,804 373,990 125,000 504,855 227,362 512,000 elements 10,900,171 8,609,590 23,737,339 5,333,507 4,950,222 500,000 9,046,865 5,757,996 2,048,000 GFlops 3,956 1,805 125 49 404 882 71 51 6,780 source (FL) (FL) (UF) (UF) (FL) (CP) (UF) (UF) (CP)

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7

The computational results are shown in the right graphic of Fig. 9. The Figure shows the GFlop/s rate on 64-bit CPU, 32-bit CPU and 32-bit CPU using the GPU strategy I. An acceleration of up to 7 can be achieved using the 32-bit GPU on the CUDA platform.

In all following numerical experiments we used the following architectures: an Intel Pentium 4 CPU with 3.40 GHz and 2MB L2 Cache, an NVIDIA GeForce 8800 GPU. the Intel MKL library Version 9.1. The computational results are shown in the left graphic of Fig. 9. The Figure shows the GFlop/s rate on 64-bit CPU, 32-bit CPU and 32-bit CPU using the GPU strategy I. An acceleration of up to 4 can be achieved using the 32-bit GPU on the CUDA platform. D. GPU/CPU Sparse Factorization for Symmetric Indenite Matrices This section gives an overview of the symmetric indenite matrices that are used for the numerical experiments and information about the matrices is given in Table II. Those markeded with an (FL) in the column source are from uid dynamics, those

V. G ENERAL -P URPOSE N ONLINEAR I NTERIOR - POINT O PTIMIZATION ON A GPU Nonlinear programming is the problem of optimizing a nonlinear nonconvex objective function to satisfying a set of nonlinear constraints, either of equalities or of inequalities. The standard form is to minimize an objective to nd a local solution of the optimization problem
xRn

min

f (x) c(x) = 0 x 0.

(1a) (1b) (1c)

s.t.

The nonlinear programming problem is given by an objective function f : Rn R and constraint functions c : Rn Rm , which are both assumed to be twice continuously differentiable. For simplicity in the notation we assume without loss of generality that all variables have only a lower bound.

Performance of PARDISO for various nonsymmetric matrices 18 CPU, 64 bit precision CPU, 32 bit precision GPU strategy I, 32 bit precision 35

Performance of PARDISO for various symmetric matrices CPU, 64 bit precision CPU, 32 bit precision GPU strategy I, 32 bit precision

14 25 Performance in GFLOP/s Performance in GFLOP/s 1 2 3 4 Matrix number 5 6 7 12

10 4 5 2

5 6 Matrix number

Fig. 9. Performance in GFlops/s for sparse factorization for nonsymmetric (left) and symmetric indenite matrices (right) on 64-bit CPU, 32-bit CPU and 32-bit GPU using GPU strategy I. TABLE III S IZE OF NONLINEAR PROGRAMMING PROBLEM FOR THE 3D PDE- CONSTRAINED OPTIMIZATION PROBLEM WITH BOUNDARY CONTROL AS A FUNCTION OF THE DISCRETIZATION PARAMETER N . Problem Size N N = 20 N = 30 N = 40 Variables with upper bounds N3 8,000 27,000 64,000 Variables with lower/upper bounds 6N 2 2,400 5,400 9,600 Variables with equality constraints N3 8,000 27,000 64,000

If we use interior-point methods, a sequence of corresponding barrier problems

n X i=1

min n

(x) = f (x) c(x) = 0,

ln(x(i) )

(2a) (2b)

s.t.

is solved to increasingly tighter tolerances, while the barrier parameter is driven to zero. Eventually, a series of large-scale symmetric indenite Karush-Kuhn-Tucker (KKT) linear systems
Wk + x I AT k Ak c I xk k (xk ) + Ak k c(xk ) ,

dened by the following nonlinear elliptic operator:

on :
y(x) y(x) + y(x)3 y(x) yd (x) = = 0 2.7 2 2(x1 (x1 1) +x2 (x2 1) + x3 (x3 1))

(3)

have to be solved during the optimization process. Here Ak = c(xk ) denotes the gradient of the constraints and Wk denotes the Hessian of the Lagrangian function for (1) with respect to x. In the following, we will use graphics processing units to solve the series of symmetric indenite KKT matrices that arise in interior-point optimization as implemented in I POPT [11], which is a primal-dual interior point software package for large-scale nonlinear programming. We will use a 64-bit CPU, a 32-bit CPU and a 32-bit GPU version of the sparse direct solver PARDISO [5], [6], which is one of the sparse direct solvers integrated in I POPT , to factor the KKT matrices. As a large-scale nonlinear programming example we choose a nonlinear PDE-constrained optimization problem with Neumann boundary conditions. The domain = (0, 1) (0, 1) (0, 1) is represented by a three-dimensional cube and the goal is to compute the optimal boundary control u(x) and state y(x) with respect to x = (x1 , x2 , x3 ) that minimizes the objective function

(5)

on :
y(x) 1.8 u(x) ud (x) = u(x) 2.5 0

f (y, u) =

1 2

(y(x) yt(x))2 dx +

(u(x) ut (x))2 dx (4)

These equations represent a simplied Ginzburg-Landau model for super-conductivity in the absence of internal magnetic elds and the state y represents the wave function [20]. We will use second-order nite-difference approximation to discretize Eqn. 4 and Eqn. 5 and the size of the nonlinear programming problem as a function of the discretization parameter N is shown in Table III. Figure 10 shows the performance of our optimized GPU parallel sparse direct solver PARDISO for various discretization sizes N in comparison to a 32-bit and 64-bit CPU version. Is is clearly visible that for larger-scale optimization problems can be achieved up to a factor of 6.5 while at the same time computing the correct local solution of the optimization problem. VI. C ONCLUSION We have reported on our experience with using graphics processing units as fast co-processors for general-purpose sparse

with a Tikhonov regularization of = 0.01. The constraints are

Speedup of largescale interior point optimization with IPOPT 7 CPU, 32 bit precision GPU strategy I, 32 bit precision 6

Speedup

CPU, 64 bit precision

30 40 Dimension of PDEconstrained problem (N)

Fig. 10. Nonlinear interior-point optimization speedup against 64-bit CPU with 32-bit CPU and 32-bit GPU using GPU strategy I.

matrix building blocks. We have shown that a minimally invasive approach to instrumenting a direct solver for large sparse systems of linear equations already results in considerable speed-up. The approach consisted in identifying and replacing computationally intensive operations by their GPU counterparts, which, compared to a CPU, typically perform better by one order of magnitude. We have applied the GPGPU-enhanced KKT solver to a nonlinear PDE-constrained optimization problem and have seen that for increasing problem dimension our optimization problem example achieved a speed-up up to a factor 6.5 compared to the original 64 bit CPU implementation. This demonstrates that current commodity GPUs are powerful, yet inexpensive devices that can act as fast numerical co-processors for sparse matrix scientic applications. R EFERENCES
[1] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kr ger, A. E. u Lefohn, and T. J. Purcell, A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, Blackwell Publishing, March 2007, pp. 80113 (34). [2] AMD Stream Processor: http://ati.amd.com/products/streamprocessor/ index.html. [3] The Cell project at IBM Research: http://www.research.ibm.com/cell/. [4] Intel Math Kernel Library 9.1 Sparse Solvers: http://www.intel.com/ cd/software/products/asmo-na/eng/266853.htm. [5] O. Schenk and K. G rtner, Solving unsymmetric sparse systems of lina ear equations with PARDISO, Journal of Future Generation Computer Systems, vol. 20, no. 3, pp. 475487, 2004. [6] , On fast factorization pivoting methods for symmetric indenite systems, Electr. Trans. Num. Anal., vol. 23, no. 1, pp. 158179, 2006. [7] E. Rothberg and A. Gupta, Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations, in Supercomputing 90. ACM-IEEE, 1990. [8] E. Ng and B. Peyton, Block sparse Cholesky algorithms on advanced uniprocessor computers, SIAM Journal on Scientic Computing, vol. 14, pp. 10341056, 1993. [9] J. Dongarra, J. DuCroz, I. Duff, and S. Hammarling, A set of level 3 basic linear algebra subprograms, ACM Transactions on Mathematical Software, vol. 16, no. 1, pp. 128, 1990. [10] T. Davis, Direct Methods for Sparse Linear Systems. SIAM, 2006. [11] A. W chter and L. T. Biegler, On the implementation of a primala dual interior point lter line search algorithm for large-scale nonlinear programming, Mathematical Programming, vol. 106, no. 1, pp. 2557, 2006. [12] N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, LUGPU: Efcient Algorithms for Solving Dense Linear Systems on Graphics Hardware. ACM/IEEE SC 2005 Conference (SC05), 2005, p. 3.

[13] J. Kurzak and J. Dongarra, Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the CELL Processor, University of Tennessee Computer Science, Tech. Rep. UT-CS-06-580, LAPACK Working Note 177), September 2006. [14] D. G ddeke, R. Strzodka, and S. Turek, Performance and accuracy o of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations, International Journal of Parallel, Emergent and Distributed Systems, vol. 22, no. 4, pp. 221256, Aug. 2007. [15] J. Kr ger and R. Westermann, Linear algebra operators for GPU imu plementation of numerical algorithms, ACM Transactions on Graphics (TOG), vol. 22, no. 3, pp. 908916, 2003. [16] J. Bolz, I. Farmer, E. Grinspun, and P. Schr der, Sparse matrix solvers o on the GPU: conjugate gradients and multigrid, in SIGGRAPH 05: ACM SIGGRAPH 2005 Courses. New York, NY, USA: ACM Press, 2005, p. 171. [17] S. W. Williams, L. Oliker, R. Vuduc, K. Yelick, J. Demmel, and J. Shalf, Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms. Proceedings of the Supercomputing 07, Nov. 2007. [18] N. Gould, Y. Hu, and J. Scott, A numerical evaluation of sparse direct solvers for the solution of large sparse, symmetric linear systems of equations, ACM Transactions on Mathematical Software (TOMS), vol. 33, no. 2, 2007. [19] T. Davis, University of Florida Sparse Matrix Collection, University of Florida, Gainesville, http://www.cise.u.edu/davis/sparse/. [20] K. Ito and K. Kunisch, Augmented Lagrangian-SQP methods for nonlinear optimal control problems of tracking type, SIAM J. Optimization, vol. 6, pp. 96125, 1996.

Matthias Christen received his M.S. degree in Mathematics from the University of Basel, Switzerland, in 2006. Currently he is a Ph.D. student in Computer Science at the University of Basel. His research interests are in computational science with emphasis on simulation and optimization, as well as high-performance computing.

Olaf Schenk received his M.S. degree in applied mathematics and computer science from the University of Karlsruhe, Germany in 1996 and the Ph.D. degree in technical sciences from the Swiss Federal Institute of Technology (ETH), Z rich, Switzerland u in 2000. Since 2001 he is a Research Associate at the Computer Science Department of the University in Basel, Switzerland. His general research interests are in high-performance computing and computational science. He is in particular interested in the solution of large-scale problems which involves information and communication technologies on high-performance computing architectures.

Helmar Burkhart is a Computer Science Professor at the University of Basel since 1987. He received a diploma in computer science from the University of Stuttgart, Germany, and a PdD degree and Venia Legendi (Habilitation) from the Swiss Federal Institute of Technology (ETH) Zurich, Switzerland. He have held several positions such as President of the Swiss Informatics Society SI / Swiss Chapter of the ACM (1990-92), member of the expert group Swiss Priority Programme in Informatics Research (1991-96), and cofounder and board member of the SPEEDUP association. His research interests include parallel and distributed processing, web technologies, and e-learning.

PLA Minimization and Testing
100% (1)
PLA Minimization and Testing
19 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Webmin User Guide
100% (4)
Webmin User Guide
808 pages
HPC-Practical-4Addition of Two Large Vectors
No ratings yet
HPC-Practical-4Addition of Two Large Vectors
4 pages
5G Heterogeneous Networks PDF
No ratings yet
5G Heterogeneous Networks PDF
63 pages
CUDA 4 1 Webinar v11-11-22
100% (1)
CUDA 4 1 Webinar v11-11-22
41 pages
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
From Everand
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Parallel and Scalable
No ratings yet
Parallel and Scalable
195 pages
A Fast LU Update For Linear Programming
No ratings yet
A Fast LU Update For Linear Programming
15 pages
Blupf90 All8
No ratings yet
Blupf90 All8
149 pages
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
From Everand
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
Robert Johnson
No ratings yet
Module 1
No ratings yet
Module 1
175 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
From Everand
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Solve System of Nonlinear Equations Matlab
100% (1)
Solve System of Nonlinear Equations Matlab
7 pages
CS3362 - Data Science Laboratory - Manual - Final-1
No ratings yet
CS3362 - Data Science Laboratory - Manual - Final-1
76 pages
High Performance Pattern Recognition On GPU
No ratings yet
High Performance Pattern Recognition On GPU
6 pages
Owens
No ratings yet
Owens
67 pages
Model QP 2022 Scheme
No ratings yet
Model QP 2022 Scheme
39 pages
3319534289
No ratings yet
3319534289
137 pages
Robust and Efficient Hamiltonian Learning: Wenjun Yu, Jinzhao Sun, Zeyao Han, and Xiao Yuan
No ratings yet
Robust and Efficient Hamiltonian Learning: Wenjun Yu, Jinzhao Sun, Zeyao Han, and Xiao Yuan
41 pages
2212 07490
No ratings yet
2212 07490
41 pages
Unit 1
No ratings yet
Unit 1
27 pages
Manual
No ratings yet
Manual
149 pages
Yu 等 - 2021 - GPU-Acceleration of the ELPA2 Distributed Eigensol
No ratings yet
Yu 等 - 2021 - GPU-Acceleration of the ELPA2 Distributed Eigensol
36 pages
Benchmarking Final
No ratings yet
Benchmarking Final
124 pages
Pramod Kumbha R
No ratings yet
Pramod Kumbha R
88 pages
Dense Matrix Algebra On The GPU
No ratings yet
Dense Matrix Algebra On The GPU
22 pages
Extended
No ratings yet
Extended
9 pages
2020.6.sparse-Tpu Ics2020
No ratings yet
2020.6.sparse-Tpu Ics2020
12 pages
ECCOMAS Oslo Article
No ratings yet
ECCOMAS Oslo Article
12 pages
CIKM
No ratings yet
CIKM
173 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
Yang 2018 Europa R
No ratings yet
Yang 2018 Europa R
16 pages
A Preliminary Study On Accelerating Simulation Optimization With GPU Implementation
No ratings yet
A Preliminary Study On Accelerating Simulation Optimization With GPU Implementation
15 pages
226ES2
No ratings yet
226ES2
35 pages
DS PGMS
No ratings yet
DS PGMS
99 pages
Numerical Algebra, Control and Optimization Volume 1, Number 1, March 2011
No ratings yet
Numerical Algebra, Control and Optimization Volume 1, Number 1, March 2011
20 pages
Data Minig
No ratings yet
Data Minig
48 pages
Crypto 101: Laurens Van Houtven (LVH)
No ratings yet
Crypto 101: Laurens Van Houtven (LVH)
242 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
No ratings yet
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
22 pages
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
No ratings yet
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
26 pages
Cuda Opencl
No ratings yet
Cuda Opencl
17 pages
ECCOMAS Glasgow Article
No ratings yet
ECCOMAS Glasgow Article
11 pages
Instructor's Guide Programming: Bjarne Stroustrup
No ratings yet
Instructor's Guide Programming: Bjarne Stroustrup
36 pages
Power System Stimulation
No ratings yet
Power System Stimulation
37 pages
Outer Space An Outer Product Based Sparse Matrix Multiplication Accelerator
No ratings yet
Outer Space An Outer Product Based Sparse Matrix Multiplication Accelerator
13 pages
Efficient Quantized Sparse Matrix Operations On Tensor Cores
No ratings yet
Efficient Quantized Sparse Matrix Operations On Tensor Cores
13 pages
Linear Solvers GPU
No ratings yet
Linear Solvers GPU
10 pages
Articles CAF Hierarchical Published
No ratings yet
Articles CAF Hierarchical Published
10 pages
BLKHL DTH THTHT Black Holes and The Math That Describes Them Describes Them
No ratings yet
BLKHL DTH THTHT Black Holes and The Math That Describes Them Describes Them
32 pages
Jennings, Tuff - An Iterative Method For Large Systems of Linear Structural Equations, 1973
No ratings yet
Jennings, Tuff - An Iterative Method For Large Systems of Linear Structural Equations, 1973
9 pages
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
No ratings yet
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
7 pages
3D Hardware design:: Software applications for GPU
From Everand
3D Hardware design:: Software applications for GPU
S Mathioudakis
No ratings yet
Accelerating CFD Simulations With Gpus: Patrice Castonguay
No ratings yet
Accelerating CFD Simulations With Gpus: Patrice Castonguay
67 pages
Sparse Tensor Core
No ratings yet
Sparse Tensor Core
13 pages
Articles CAF Symmetric FSM Published
No ratings yet
Articles CAF Symmetric FSM Published
9 pages
High-Performance Matrix-Vector Multiplication On The GPU: Abstract
No ratings yet
High-Performance Matrix-Vector Multiplication On The GPU: Abstract
10 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
Random Demodulator
No ratings yet
Random Demodulator
8 pages
Lasa Abstraction and Specialization For Productive and Performant Linear Algebra On FPGAs
No ratings yet
Lasa Abstraction and Specialization For Productive and Performant Linear Algebra On FPGAs
7 pages
LightSpMV Faster CSR-based Sparse Matrix-Vector Multiplication On CUDA-enabled GPUs
No ratings yet
LightSpMV Faster CSR-based Sparse Matrix-Vector Multiplication On CUDA-enabled GPUs
8 pages
Libraries
No ratings yet
Libraries
10 pages
Solving Pdes With Cuda
No ratings yet
Solving Pdes With Cuda
34 pages
TIGRE - A MATLAB GPU Toolbox For CBCT Image Reconstruction
No ratings yet
TIGRE - A MATLAB GPU Toolbox For CBCT Image Reconstruction
11 pages
Low-Rank Structure Learning Via Nonconvex Heuristic Recovery
No ratings yet
Low-Rank Structure Learning Via Nonconvex Heuristic Recovery
14 pages
78 Cbo 012
No ratings yet
78 Cbo 012
100 pages
Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication
No ratings yet
Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication
5 pages
GPU Implementations For Finite Element Methods: Brian S. Cohen 12 December 2016
No ratings yet
GPU Implementations For Finite Element Methods: Brian S. Cohen 12 December 2016
13 pages
Spe 14234
No ratings yet
Spe 14234
13 pages
Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms
No ratings yet
Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms
9 pages
The Flask Security Architecture: System Support For Diverse Security Policies
No ratings yet
The Flask Security Architecture: System Support For Diverse Security Policies
17 pages
Kernel Oopsing
No ratings yet
Kernel Oopsing
21 pages
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
No ratings yet
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
11 pages
Channel Estimation For RIS II
No ratings yet
Channel Estimation For RIS II
5 pages
Summary and Conclusion
No ratings yet
Summary and Conclusion
7 pages
Sum Product Paper
No ratings yet
Sum Product Paper
10 pages
Research Taxonomy: Le Xuan Hung, PHD Fellow U-Security Research Group
No ratings yet
Research Taxonomy: Le Xuan Hung, PHD Fellow U-Security Research Group
17 pages
Bus Admittance Tearing and The Partial Inverse
No ratings yet
Bus Admittance Tearing and The Partial Inverse
7 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
S&TR October/November 2011
No ratings yet
S&TR October/November 2011
8 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
BuddyBland Titan SC12
No ratings yet
BuddyBland Titan SC12
12 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Abstract:: Keywords
No ratings yet
Abstract:: Keywords
2 pages
An Implementation of A FIR Filter On A GPU: Alexey Smirnov and Tzi-Cker Chiueh
No ratings yet
An Implementation of A FIR Filter On A GPU: Alexey Smirnov and Tzi-Cker Chiueh
8 pages
X Window
No ratings yet
X Window
7 pages
Finite Element Programming With MATLAB
100% (8)
Finite Element Programming With MATLAB
58 pages
Neural Network Implementation Using CUDA and OpenMP
No ratings yet
Neural Network Implementation Using CUDA and OpenMP
7 pages
Application of The Mole-8.5 Supercomputer: Probing The Whole Influenza Virion at The Atomic Level
No ratings yet
Application of The Mole-8.5 Supercomputer: Probing The Whole Influenza Virion at The Atomic Level
5 pages
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
No ratings yet
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
4 pages
Parallelization of BFS Graph Algorithm Using CUDA
No ratings yet
Parallelization of BFS Graph Algorithm Using CUDA
6 pages
ChenruoQi Proposal
No ratings yet
ChenruoQi Proposal
4 pages
Petascale Challenges For Cosmological Simulation: by Paul Ricker
No ratings yet
Petascale Challenges For Cosmological Simulation: by Paul Ricker
3 pages
EM-Based Design of Large-Scale Dielectric-Resonator Filters and Multiplexers by Space Mapping
No ratings yet
EM-Based Design of Large-Scale Dielectric-Resonator Filters and Multiplexers by Space Mapping
7 pages
DHB Java
No ratings yet
DHB Java
2 pages
A Study of Rsa Cryptosystem and Other Public-Key Cryptography
No ratings yet
A Study of Rsa Cryptosystem and Other Public-Key Cryptography
1 page
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
CPU And/or GPU: Revisiting The GPU vs. CPU Myth: March 2013
No ratings yet
CPU And/or GPU: Revisiting The GPU vs. CPU Myth: March 2013
21 pages
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

Christen 07

Uploaded by

Christen 07

Uploaded by

1

MatrixMatrix Multiplication AB=C: CPU Performance in MFLOP/s, k=50

m = number of rows of A and C

30 20 10 1600 1800 2000

200 200 400

MatrixMatrix Multiplication AB=C: GPU (with read/write) Performance in MFLOP/s, k=50

40 60 80 n = number of columns of B and C

600 800 1000 1200 1400 n = number of columns of B and C

MatrixMatrix Multiplication AB=C: GPU (with read/write) Performance in GFLOP/s, k=4096 90

1800 1600 m = number of rows of A and C 1400 1200 1000

m = number of rows of A and C

1800 1600 m = number of rows of A and C 1400

1200 1000 800

200 100 120

40 60 80 n = number of columns of B and C

80 70 600 800 1000 1200 1400 1600 n = number of columns of B and C

Performance of TRSM for large matrices 70

180 180 8 140 140 6

2000 2500 Matrix size (n)

200 Matrix size (n)

For our hardware and software environment, a block size of

min. time 500

300 Block size

2000 2500 Matrix size (n)

Fig. 5. Performance of dense linear factorization sgetrf on CPU and GPU.

30 GPU w/o read/write GPU with read/write CPU 25

2.4 106 operations

for j = 0 to # supernodes do for k = 0 to j 1 do

solve X Lj = L for X , L X j j solve X Lj = Uj for X , Uj X end

Supernodal LU decomposition for non-symmetric matrices.

(1a) (1b) (1c)

14 25 Performance in GFLOP/s Performance in GFLOP/s 1 2 3 4 Matrix number 5 6 7 12

If we use interior-point methods, a sequence of corresponding barrier problems

(x) = f (x) c(x) = 0,

dened by the following nonlinear elliptic operator:

(u(x) ut (x))2 dx (4)

with a Tikhonov regularization of = 0.01. The constraints are

CPU, 64 bit precision

30 40 Dimension of PDEconstrained problem (N)

You might also like