0% found this document useful (0 votes)
50 views

High Performance Computing Using Out-of-Core Sparse Direct Solvers

Uploaded by

eko123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

High Performance Computing Using Out-of-Core Sparse Direct Solvers

Uploaded by

eko123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

World Academy of Science, Engineering and Technology

International Journal of Mathematical and Computational Sciences


Vol:3, No:9, 2009

High Performance Computing Using Out-of-


Core Sparse Direct Solvers
Mandhapati P. Raju and Siddhartha Khaitan

However, as the problem size increases, the in-core


Abstract—In-core memory requirement is a bottleneck in solving memory requirement quickly exceeds 16GB. To increase the
large three dimensional Navier-Stokes finite element problem size of RAM is cost wise very expensive. On the other hand
formulations using sparse direct solvers. Out-of-core solution out-of-core solvers can handle very large problems with
strategy is a viable alternative to reduce the in-core memory
smaller in-core memory requirements. The disadvantage of
requirements while solving large scale problems. This study
evaluates the performance of various out-of-core sequential solvers using out-of-core solvers is that the computational time
based on multifrontal or supernodal techniques in the context of increases due to the I/O operations on the disk. In the recent
Open Science Index, Mathematical and Computational Sciences Vol:3, No:9, 2009 waset.org/Publication/4908

finite element formulations for three dimensional problems on a past there has been lot of research to reduce the time for I/O
Windows platform. Here three different solvers, HSL_MA78, and make the out-of-core solvers efficient. The capability and
MUMPS and PARDISO are compared. The performance of these performance of out-of-core solvers in the context finite
solvers is evaluated on a 64-bit machine with 16GB RAM for finite
element Navier-Stokes code is assessed in this paper. Three
element formulation of flow through a rectangular channel. It is
observed that using out-of-core PARDISO solver, relatively large state-of-the art out-of-core solvers - MUMPS, HSL_MA78
problems can be solved. The implementation of Newton and and PARDISO are evaluated. To the best of author’s
modified Newton's iteration is also discussed. knowledge no such comparison of the performance of out-of-
core has been reported in the literature.
Keywords—Out-of-core, PARDISO, MUMPS, Newton. MUMPS [11]-[13] is a parallel direct solver with out-of-
core functionality and is available in the public domain.
I. INTRODUCTION PARDISO [2], [14]-[17] also has an out-of-core solver and it
is available as a part of the INTEL Math Kernel Library [18].
T HE use of sparse direct solvers in the context of finite
element discretization of Navier-Stokes equations for
three dimensional problems is limited by its huge memory
HSL_MA78 [19], an out-of-core solver, is available as part of
HSL 2007, which is available free for any UK researchers. An
requirement. Nevertheless, direct solvers are preferred due to evaluation version of HSL_MA78 is used in this paper.
their robustness. The development of sparse direct solvers In finite element Navier-Stokes formulations, the set of
based on algorithms like multifrontal [1], supernodal [2] etc. linear equations generated usually generate a matrix that zero
have significantly reduced the memory requirements diagonal entries. Penalty formulation yields non-zero diagonal
compared to the traditional frontal solvers [3]. The superior entries but it is observed that the diagonal entries are few
performance of multifrontal solvers has been demonstrated for orders of magnitudes smaller than the other non diagonal
different CFD applications [4]-[7] and also in power system entries. The iterative solution methods fail or pose severe
simulations [8]-[10]. It has been identified [4]-[7] that the convergence problems for such ill conditioned matrices.
memory requirement is a bottleneck in solving large three- Although the iterative solvers are memory efficient, the
dimensional CFD problems. There are different viable resolution of convergence issues is not straightforward and
alternatives for overcoming the huge memory requirements. results in lack of robustness. The performance of a suite of
One alternative is to run on a 64 bit machine having large iterative solvers is compared with the out-of-core direct
RAM. The second alternative is to use out-of-core solver, solvers to demonstrate the superiority of direct solvers.
where the factors are written to the disk, thereby minimizing
the in-core requirements. The third alternative is to use II. MATHEMATICAL FORMULATION
parallel solvers in a distributed computing environment where A benchmark rectangular channel flow problem is chosen
the memory is distributed amongst the different processors. for evaluating the out-of-core solvers. The governing
Recent efforts by the authors show that by using a 64 bit and equations for laminar flow inside a rectangular channel are
16GB RAM machine, relatively larger problems can be presented below in the non-dimensional form. In three-
handled in-core. dimensional calculations, instead of the primitive
formulation, penalty approach is used to reduce the memory
Mandhapati P. Raju is currently with the General Motors Inc., Warren, MI requirements. The equations are all presented in the non-
48093USA (phone: 586-986-1365; e-mail: [email protected]).
Siddhartha Khaitan, is with Iowa State University, Ames, IA 50011 USA. dimensional form.
(e-mail: [email protected]).

International Scholarly and Scientific Research & Innovation 3(9) 2009 639 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Mathematical and Computational Sciences
Vol:3, No:9, 2009

∂uˆ ∂vˆ ∂wˆ ∂R ( )


i
+ + = 0, (1) [ J ]( ) =  X(i ) .                                    (10) 
i

∂xˆ ∂yˆ ∂zˆ ∂X



{RX }( ) is the residual vector. Newton’s iteration is continued
i


∂ 2 ∂ ⎛ ∂uˆ ∂vˆ ∂wˆ ⎞ ∂ ⎛ 2 ∂uˆ ⎞
∂xˆ
( uˆ ) + ∂∂yˆ ( uvˆ ˆ ) + ∂∂zˆ ( uw
ˆˆ)=λ ⎜ + + ⎟+ ⎜ ⎟
∂xˆ ⎝ ∂xˆ ∂yˆ ∂zˆ ⎠ ∂xˆ ⎝ Re ∂xˆ ⎠
(2) till the infinity norm of the correction vector δ X (i ) converges

∂ ⎛ 1 ⎛ ∂uˆ ∂vˆ ⎞ ⎞ ∂ ⎛ 1 ⎛ ∂uˆ ∂wˆ ⎞ ⎞
to a prescribed tolerance of 10-10. A modified Newton’s
+ ⎜ ⎜ + ⎟⎟ + ⎜ + ⎟ ,
∂yˆ ⎝ Re ⎝ ∂yˆ ∂xˆ ⎠ ⎠ ∂zˆ ⎝⎜ Re ⎝ ∂zˆ ∂xˆ ⎠ ⎠⎟ method is also used in this study. For modified Newton eq. (9)
is modified as shown in eq. 11
[ J ]( 0) {δ X (i ) } = − {RX } .       (11) 
(i )
∂ ∂ ∂ ∂ ⎛ ∂uˆ ∂vˆ ∂wˆ ⎞ ∂ ⎛ 1 ⎛ ∂uˆ ∂vˆ ⎞ ⎞
ˆ ˆ ) + ( vˆ 2 ) + ( vw
( uv ˆˆ)=λ ⎜ + + ⎟+ ⎜ ⎜ + ⎟⎟
(3)  
∂xˆ ∂yˆ ∂zˆ ∂yˆ ⎝ ∂xˆ ∂yˆ ∂zˆ ⎠ ∂xˆ ⎝ Re ⎝ ∂yˆ ∂xˆ ⎠ ⎠
In modified Newton’s method the Jacobian is evaluated
∂ ⎛ 2 ∂vˆ ⎞ ∂ ⎛ 1 ⎛ ∂vˆ ∂wˆ ⎞ ⎞ only during the first iteration. Consequently the Jacobian is
+ ⎜ ⎟+ ⎜ ⎜ + ⎟⎟,
∂yˆ ⎝ Re ∂yˆ ⎠ ∂zˆ ⎝ Re ⎝ ∂zˆ ∂yˆ ⎠ ⎠
factorized only once. For all subsequent iterations, the same
Jacobian (and hence its LU factors) is used repeatedly. This
and algorithm is referred as modified Newton. Since factorization
∂ ∂ ∂ ∂ ⎛ ∂uˆ ∂vˆ ∂wˆ ⎞ ∂ ⎛ 1 ⎛ ∂uˆ ∂wˆ ⎞ ⎞ (4) 
( uw ˆ ˆ ) + ( wˆ 2 ) = λ ⎜ + +
ˆ ˆ ) + ( vw ⎟+ ⎜ + ⎟ is the most expensive part of the computations, by using
∂xˆ ∂yˆ ∂zˆ ∂zˆ ⎝ ∂xˆ ∂yˆ ∂zˆ ⎠ ∂xˆ ⎜⎝ Re ⎝ ∂zˆ ∂xˆ ⎠ ⎟⎠
modified Newton’s algorithm, the expensive factorization step
∂ ⎛ 1 ⎛ ∂vˆ ∂wˆ ⎞ ⎞ ∂ ⎛ 2 ∂wˆ ⎞
+ ⎜ ⎜ + ⎟⎟ + ⎜ ⎟. can be skipped after the first iteration.
∂yˆ ⎝ Re ⎝ ∂zˆ ∂yˆ ⎠ ⎠ ∂zˆ ⎝ Re ∂zˆ ⎠
Open Science Index, Mathematical and Computational Sciences Vol:3, No:9, 2009 waset.org/Publication/4908

We can see that the discretizations of the governing partial


differential equations from (7)-(10) by the GFEM scheme
where uˆ , vˆ, wˆ are the components of velocity,. The bulk
results in a set of nonlinear equations. However the core of the
flow Reynolds number, Re, and λ is the penalty parameter. resulting nonlinear equations is the solution of a sparse linear
Velocities are non-dimensionalized with respect to inlet systems (eq. 9), which is the most computationally intensive
velocity and the coordinates are non-dimensionalized with part of the solver both in terms of CPU time and memory
respect to channel length. requirement. Here three different out-of-core solvers,
The boundary conditions are prescribed as follows: MUMPS, HSL_MA78 and PARDISO are implemented and
(1) Along the channel inlet: compared.
uˆ = 1; vˆ = 0; wˆ = 0. (5) To gain maximum computational efficiency the codes are
optimized at three levels.
(2) Along the channel exit : (a) The first is at the hardware level by using an optimized
∂uˆ ∂vˆ ∂wˆ (6) Intel MKL BLAS library. This is highly optimized for Intel
= 0; = 0; = 0.
∂xˆ ∂xˆ ∂xˆ processors.
(b) The second level is the choice of an efficient state-of-
(3) Along the walls: the-art out-of-core solver. Three different out-of-core solvers
uˆ = 0; vˆ = 0; wˆ = 0. (7) are evaluated for their performance. The efficiency of an out-
of-core solver not only depends on the factorization
algorithms of the solvers but also on the handling of different
The flow Reynolds number is taken as 50 to simulate I/O operations. For an out-of-core solver, the I/O operations
laminar flow inside the channel. can be a bottleneck depending on how the I/O operations are
performed.HSL_MA78 handles efficiently using virtual
III. NUMERICAL FORMULATION memory management package HSL_OF01 which facilitates
reading and writing from direct-access files. Real and integer
Galerkin finite element method (GFEM) is used for the data have their own buffers associated with it. Each buffer can
discretization of the above penalty based Navier Stokes be associated with more than one direct-access file.
equations. Three dimensional brick elements are used; the (c) The third level is the choice of an efficient algorithm for
velocity components are interpolated bilinearly. The solving the system of non-linear equations. The choice of the
nonlinear system of equations obtained from GFEM is solved non-linear algorithm can affect the rate of convergence and
by Newton’s method. Let be the available vector of field hence the computational time. The system of non-linear
unknowns for the ith iteration. Then the update for the equations is either solved using Newton or Picard iteration.
iteration is obtained as Newton iteration is quite popular and efficient due to its
X (i +1) = X (i ) + α δ X (i ) ,             (8)  quadratic convergence behavior. If the initial guess is chosen
   ( i)
where α is an under-relaxation factor, and δ X is the properly, then Newton iteration can give convergence in a few

correction vector obtained by solving the linearized system iterations. However the limitation is that the formation of
[ J ](i ) {δ X (i ) } = − {RX } .                    (9) 
(i ) Jacobian matrices involving derivatives is not always
  straightforward to compute. In addition the choice of the
Here, [ J ](i ) is the Jacobian matrix at the (i+1)st iteration,
initial guess will affect the convergence behavior. Picard

International Scholarly and Scientific Research & Innovation 3(9) 2009 640 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Mathematical and Computational Sciences
Vol:3, No:9, 2009

iteration however is more robust in terms of its flexibility for [22]). In addition, there is a provision to link METIS [23] as
choosing an initial guess. However, the rate of convergence an external package. Memory relaxation is taken as 100%.
will be linear and hence would result in more computational MUMPS out-of-core solver is used for all the cases. Table 1
time. shows the comparison of the performance of the various
The choice of modified Newton or modified Picard can ordering methods. All the cases are run for 30x30x30 mesh.
further significantly reduce the computational time. In the The CPU time and memory for each of the solver are
modified Newton method, the left hand side matrix is compared. The CPU time reported is the CPU time for the
factorized only once and the factors are reused. Since first Newton iteration. The CPU time and memory
factorization is the bottleneck, by avoiding the factorization in requirement for the complete in-core solution is also included
the subsequent steps can reduce the computational time. The in the brackets for a quick comparison. It is to be first noted
rate of convergence will no longer be quadratic but linear. that the out-of-core solution is around 3-5 times slower
Nevertheless, there will be significant savings in compared to the in-core solution. Of all the ordering packages,
computational time. This paper discussed only Newton and METIS gives best results. Compared to AMD, METIS results
modified Newton implementation. in almost one-third of the floating point operations. The
computational time and memory requirements are the lowest
IV. RESULTS AND DISCUSSION for the METIS ordering. Nested bisection algorithm of
In this paper, flow inside a three dimensional rectangular METIS is found to generate good ordering for three
Open Science Index, Mathematical and Computational Sciences Vol:3, No:9, 2009 waset.org/Publication/4908

channel is considered. Three dimensional finite element brick dimensional meshes. Based on this result, METIS ordering is
elements are used for generating the grid. Weak Galerkin used for all subsequent runs using MUMPS solver.
finite element formulation is used to discretize the Navier-
TABLE 1: COMPARISON OF ORDERING METHODS FOR THE MUMPS SOLVER
Stokes equations to form a large set of non-linear equations.
Newton's iteration is used to generate a set of linear algebraic Memory (GB)
ordering #dof's Cpu time (sec) in-core out-of-
equations. The matrices generated from such discretization are arrays core files
usually very sparse and hence a good sparse solver is used to
AMD 89373 438.4 (142.8) 1.4 (4.06) 2
reduce the computational efforts. It is to be noted that for three
dimensional grids, the matrices generated are less sparse QAMD 89373 446.6 (142.75) 1.4 (4.04) 2
compared to the matrices generated from two-dimensional AMF 89373 352 (105.7) 1.34 (3.48) 1.67
grid. PORD 89373 309 (86.6) 1.09 (3.18) 1.5
Typically an interior node in a three-dimensional grid is
METIS 89373 250.3 (55.01) 0.78 (3.02) 1.28
connected to 27 nodes including it. Since there are 3 dof's at
each node, a typical row consists of 81 non-zero entries. In a HSL_MA78 solver
two-dimensional grid, a typical row consists of 27 non-zero The HSL MA78solver does not any internal ordering
entries. This would increase the frontal size considerably. techniques but however the HSL solver package has other
Hence solving three-dimensional problems using direct routines which do the function of ordering the finite element
solvers is quite challenging both in terms of computational entries to reduce the fill-in during factorization. HSL MC68 is
time and memory requirements. Large problems cannot be generally used for efficient ordering of finite element
solved on a 32-bit machine using in-core techniques [4], [7]. matrices. In addition, external ordering packages can be
This paper studies the performance of out-of-core direct hooked to the HSL MA78 solver. In this paper METIS
solvers on a 64 bit machine with 16GB RAM. All the ordering is also used by hooking the METIS library to the
computations are run a windows machine with Intel Xeon solver. Table 2 shows the performance of HSL MC68 and
processor. METIS ordering on the CPU time and memory of the HSL
Before comparing the various solvers for their relative MA78 solver. It is found that METIS performs better than
performances, each individual solver is tuned for its optimal HSL MC68. Hence METIS is used for all subsequent runs for
performance, specifically the choice of the ordering package. the HSL MA78 solver.
Each solver has inbuilt ordering packages, whose choice can
affect the performance of the solver. In addition there are TABLE II: COMPARISON OF ORDERING METHODS FOR THE HSL-MA78 FOR
other parameters like pivot tolerance etc which will affect the 30X30X30 GRID
performance of the solver. Memory (GB)
ordering #dof's Cpu time (sec) in-core out-of-
MUMPS solver arrays core files
The sequential version of out-of-core MUMPS solver is HSL_MC68 89373 536 (524) 1.4 (6.88) 3.5
built on a 64 bit machine. The choice of the out-of-core solver METIS 89373 321 (318) 0.79 (4.72) 2
can be invoked by setting the value of
mumps_par%ICNTL(22) as 1. MUMPS has different inbuilt PARDISO solver
ordering packages (AMD [20], QAMD [21], AMF, PORD PARDISO has minimum dissection (MD) and METIS

International Scholarly and Scientific Research & Innovation 3(9) 2009 641 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Mathematical and Computational Sciences
Vol:3, No:9, 2009

ordering hooked internally within the solver. The user has the HSL_MA78 and PARDISO solvers for different grid sizes.
choice to use either of the ordering techniques. Table 3 The performance of in-core solution is also presented for
compares the effect of MD and METIS ordering on the relative comparison. Table 4 shows that out-of-core MUMPS
performance of PARDISO solver. It is found that METIS solver is always greater than 4 times slower compared to the
performs better than the MD ordering technique. Hence in-core solver. This shows that the out-of-core implementation
METIS is used for all subsequent runs for the PARDISO of MUMPS solver is less efficient. The in-core memory
solver. requirement is maintained low. Surprisingly the out-of-core
HSL_MA78 solver is very efficient with respect to the in-core
TABLE III: COMPARISON OF ORDERING METHODS FOR THE PARDISO SOLVER
FOR 30X30X30 GRID
solver. The computational times for both in-core and out-of-
ORDERING #DOF'S CPU TIME MEMORY (GB) core implementations are almost similar. The in-core memory
(SEC)
IN-CORE OUT-OF-CORE
for the out-of-core solver is maintained low. The out-of-core
ARRAYS FILES memory requirement is larger for HSL_MA78 solver
MD 89373 284 (162) 0.5 (2.71) 2.8 compared to the other two solvers.
METIS 89373 97 (59) 0.3 (1.42) 1.25
TABLE VI: PERFORMANCE OF MUMPS SOLVER ON DIFFERENT GRID SIZES

        MUMPS in-core MUMPS out-of-core


Table 4 and Table 5 show the time split between different
cpu cpu Memory (GB)
phases of the solver for in-core and out-of-core solution.
Open Science Index, Mathematical and Computational Sciences Vol:3, No:9, 2009 waset.org/Publication/4908

time Memory time incor Out-of-


Interestingly, the performance of in-core and out-of-core for nex ney nez #dof's (min) (GB) (min) e core
HSL_MA78 solver is almost the same in terms of 50 10 10 18513 0.047 0.23 0.31 0.075 0.11
computational time. This may be because of the efficient I/O
100 10 10 36663 0.089 0.52 0.63 0.12 0.23
operation used in HSL_MA78 package. It used virtual
200 10 10 72963 0.177 1.1 1.25 0.2 0.465
memory management using HSL_F01 packages to handle
efficient I/O operations. This strategy is found to be very 50 20 10 35343 0.14 0.65 0.82 0.17 0.285
effective in developing good out-of-core solvers. Although the 100 20 10 69993 0.278 1.4 1.74 0.27 0.612
out-of-core HSL is performing well in comparison to the in-
100 20 20 133623 1.127 3.87 5.31 0.72 1.72
core, the overall computation time is much larger compared to
the other solvers. MUMPS out-of-core solver is almost 4 100 50 20 324513 6.4 13.42 21.39 2.54 6.13

times slower than in-core solver. PARDISO out-of-core solver 100 50 50 788103 * * 126.1 10.2 24.1
is around 1.6 times slower than its in-core solver. Another 50 20 20 67473 0.45 1.81 2.36 0.47 0.78
interesting observation is that out-of-core PARDISO has a
50 50 10 85833 0.495 2.11 2.72 0.45 0.925
relatively large solve phase compared to out-of-core MUMPS.
PARDISO has much less in-core memory requirement 50 50 20 163863 2.228 5.93 8.67 1.3 2.64

compared to MUMPS or HSL. 50 50 50 397953 * * 42.5 5 10.12

TABLE IV: COMPARISON OF TIME SPLIT FOR IN-CORE SOLUTION OF 30X30X30 TABLE VII: PERFORMANCE OF HSL_MA78 SOLVER ON DIFFERENT GRID SIZES
GRID
        HSL in-core HSL out-of-core
           
cpu Memory (GB)
time Memo cpu Out-
In-core Computational time (Seconds) Memory
(min ry time In- of-
solvers Matrix Analysi Numeric Solve Total (GB)
nex ney nez #dof's ) (GB) (min) core core
assembly s phase phase phase time

MUMPS 4 1.51 53.3 0.58 59.39 3.02 50 10 10 18513 0.18 0.24 0.193 0.14 0.32

PARDISO 4 2.19 52.6 0.47 59.26 1.42 100 10 10 36663 0.35 0.5 0.363 0.14 0.625
HSL_MA7
8 4 0.64 313.14 317.78 4.72 200 10 10 72963 0.59 0.79 0.612 0.14 1.15
  50 20 10 35343 0.78 0.65 0.8 0.23 0.84
TABLE 5: COMPARISON OF TIME SPLIT FOR OUT-OF-CORE SOLUTION OF 100 20 10 69993 1.06 1.53 1.074 0.23 1.475
30X30X30 GRID
100 20 20 133623 4.83 3.91 3.936 0.5 3.54
Computational time (Seconds) Memory (GB)
Out- 100 50 20 324513 28.9 13.94 23.52 1.56 12.67
Out-core Matrix Analysis Numeric Solve Total Incore of-core
solvers assembly phase phase phase time files files 100 50 50 788103 * * 542.6 8.1 49.95
250.2
MUMPS 4 1.6 243.2 1.45 5 0.78 1.28 50 20 20 67473 2.42 1.98 2.46 0.5 1.93
PARDISO 4 2.2 87.76 3.07 97.03 0.3 1.25 50 50 10 85833 2.65 2.38 2.69 0.5 2.37
HSL_MA7 318.6
8 4 0.64 314 4 0.79 3.174 50 50 20 163863 11.9 7.12 11.3 1.56 6.45

50 50 50 397953 * * 358 5.8 24


Tables 6-8 shows the performance of out-of-core MUMPS,

International Scholarly and Scientific Research & Innovation 3(9) 2009 642 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Mathematical and Computational Sciences
Vol:3, No:9, 2009

Table 8 shows the performance of PARDISO solver. The core refers to the in-core and out-of-core memory
out-of-core solver is around twice slower compared to the in- requirements for the out-of-core solver, nex, ney and nez refer
core solver. Overall PARDISO out-of-core solver is must to the grid elements in the x,y and z directions respectively, n
faster compared to the other two solvers. The in-core memory refers to the total number of degrees of freedom, ar and ar
requirement is kept very low. For a 100x50x50 mesh, out-of- refers to the grid aspect ratio's nex/ney and nex/nez.
core MUMPS requires around 10 GB of in-core memory and MUMPS 
out-of-core HSL_MA78 requires around 8 GB of in-core T = 3.56 ×10−7 n1.447 ar1−0.127 ar2 −0.127 ; R 2 = 0.95 (12)
memory and out-of-core PARDISO requires around 4 GB of M incore = 8.28 ×10−7 n1.214 ar1−0.197 ar2 −0.197 ; R 2 = 0.98 (13)
in-core memory. Hence PARDISO can solve much finer grid
M outofcore = 2.11×10−7 n1.377 ar1−0.127 ar2 −0.127 ; R 2 = 0.98 (14) 
sizes compared to the other two solvers. The finest grid sizes
chosen to solve with PARDISO in this paper are 150x75x30 HSL
and 200x80x40, which consists of around 1 million and 2 T = 3.53 ×10−8 n1.707 ar1−0.362 ar2 −0.362 ; R 2 = 0.7 (15)
million degrees of freedom. The 150x75x30 grid requires M incore = 2.6 ×10 n−4 0.736
ar
1
−0.33
ar2 −0.33
; R = 0.98
2 (16)
around 5.5 GB of in-core memory. The out-of-core memory is −6 −0.155 −0.155 (17) 
M outofcore = 3.42 ×10 n 1.219
ar
1 ar2 ; R = 0.99
2

around 31 GB. It takes around 172 seconds for one Newton


iteration. The 200x80x40 grid requires around 10.5 GB of in- PARDISO
core memory and around 75 GB out-of-core memory. One T = 4.07 ×10−9 n1.757 ar1−0.174 ar2 −0.174 ; R 2 = 0.85 (18)
Open Science Index, Mathematical and Computational Sciences Vol:3, No:9, 2009 waset.org/Publication/4908

−6 −0.01 −0.01
Newton iteration takes around 16.5 hours of CPU time. Thus M incore = 3.84 ×10 n 1.02
ar 1 ar2 ; R = 0.99
2 (19)
we observe that out-of-core PARDISO can solve very large −7
M outofcore = 1.39 ×10 n 1.407
ar−0.112
ar2 −0.112
; R 2 = 0.99 (20)
1
three dimensional problems in the context of using direct
 
solvers on a single desktop. Both in terms of computational
The correlations will give an idea of how the solver
time and memory requirement, PARDISO is found to the best
requirements vary as the grid size is modified. The exponents
solver.
of n is greater than 1 indicating that as the number of degrees
of freedom increases, the CPU time and memory requirement
TABLE VIII: PERFORMANCE OF PARDISO SOLVER ON DIFFERENT GRID SIZES
are going to increase superlinearly. For the out-of-core
        PARDISO in‐core  PARDISO out‐of‐core  solvers, the exponents of n are similar for all the three solvers,
cpu  cpu  Memory (GB)   
ne time  Memory  time  incore  out‐of‐  with MUMPS being the lower of around 1. 45. The CPU time
nex  ney  z  #dof's  (min)  (GB)  (min)                core  (and memory requirement are not only a function of number
50  10  10  18513  0.038  0.08  0.067  0.087  0.1  of degrees of freedom but also a function of the grid aspect
0.15  0.17  0.22 
ratio’s. The absolute values of the exponents of the aspect
100  10  10  36663  0.073  0.25 
ratio’s is larger for HSL solver compared to MUMPS and
200  10  10  72963  0.157  0.57  0.304  0.34  0.47 
PARDISO. This indicates that the solver performance is also a
50  20  10  35343  0.105  0.29  0.209  0.17  0.27  strong function of the grid distribution also. The memory
100  20  10  69993  0.243  0.69  0.515  0.337  0.593  requirements of the out-of-core solver consists of in-core
1.758  0.665  1.693 
memory (for holding the frontal matrices and other working
100  20  20  133623  1.073  1.92 
arrays) requirement and out-of-core memory (consists of LU
9.5  1.65  6.035 
100  50  20  324513  6.517  6.62  factors written to the disk) requirements. Correlations are
100  50  50  788103  *  *  114.2  4.1  24.5  presented for both the memory requirements. An interesting
0.731  0.33  0.763  observation is that the in-core memory requirement for
50  20  20  67473  0.432  0.85 
PARDISO the least amongst the three solvers and that the
50  50  10  85833  0.457  1.02  0.84  0.42  0.895 
memory requirement varies linearly with the number of
50  50  20  163863  2.233  2.86  3.4  0.83  2.6  degrees of freedom and is almost independent of the grid
50  50  50  397953  17.650  10.67  25.4  2.03  10.15  distribution. This behavior of out-of-core PARDISO is very
171.95  5.52  31 
conducive for choosing it for solving large three dimensional
150  75  30  1067268  *  * 
finite element problems.
993  10.5  74.8 
200  80  40  2002563  *  *   

Correlations are generated for the CPU times and memory


requirements of all the solvers with respect to the grid size.
Equation 12-15 shows the correlations for the MUMPS solver.
In these equations T refers to the CPU time in minutes taken
by the solver for one Newton iteration. It includes the time for
generation of the matrix, the analysis phase, factorization
phase and the solve phase. M refers to the memory
requirement in GigaBytes, the subscripts in-core and out-of-

International Scholarly and Scientific Research & Innovation 3(9) 2009 643 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Mathematical and Computational Sciences
Vol:3, No:9, 2009

100 400 increases, the percentage of computational savings using


350
modified Newton's method as compared to the Newton's
10
-2
method increases.

Cumulative CPU time (sec)


300

10-4 Newton
250 V. CONCLUSIONS
Modified Newton
||δX||∞

200
Three different out-of-core solvers (MUMPS, HSL_MA78,
10-6
PARDISO) are evaluated for the solution of finite element
-8
150
Navier-Stokes formulation of laminar flow in a rectangular
10
100 channel. METIS is found to be the best choice of ordering
10 -10
algorithm for reducing the fill in of LU factors. Of the three
50
solvers, PARDISO is found to the best solver with lower
0 1 2 3 4 5 6 7
0 computational time and lower in-core and out-of-core memory
Iterations   requirements.
It is observed that out-of-core HSL_MA78 is found to
Fig. 1 Comparison of CPU time and residual norm for Newton and
modified Newton's method for 30x30x30 grid perform almost identically with that of the in-core
HSL_MA78 solver. HSL_F01 facilitates the efficient I/O
Figure 1 shows the performance of Newton and operations for the HSL_MA78 solver. However the out-of-
Open Science Index, Mathematical and Computational Sciences Vol:3, No:9, 2009 waset.org/Publication/4908

modified Newton algorithms using out-of-core PARDISO as core HSL_MA78 is much slower than MUMPS and
the linear solver for a 30x30x30 grid. No under-relaxation is PARDISO out-of-core solvers. Out-of-core strategy can help
used for both Newton and modified Newton's method. in solving large three dimensional finite element problems.
Newton's method converges in 4 iterations and quadratic Out-of-core PARDISO could solve around 2 million equations
convergence is observed. Modified Newton's method resulting from three dimensional finite element formulations
converges in 6 iterations. Linear convergence is observed. It is on a single desktop. Further it is observed that the use of
observed that significant computational time savings can be modified Newton's algorithm can significantly reduce the
achieved using modified Newton's method. Newton's computational time as compared to the Newton's algorithm.
iterations converge in 381 seconds, whereas modified
Newton's iterations converge in 128 seconds. REFERENCES
[1] T. A. Davis and I.S. Duff, “A combined unifrontal/multifrontal method
TABLE IX: COMPARISON OF CPU TIMES FOR NEWTON AND MODIFIED for unsymmetric sparse matrices,” ACM Trans. Math. Soft., vol. 25, no.
NEWTON FOR DIFFERENT GRIDS 1, 1997, pp. 1–19.
[2] O. Schenk, K. Gartner, and W. Fichtner, “Efficient Sparse LU
PARDISO out‐of‐core    Newton  Modified Newton  Factorization with Left-right Looking Strategy on Shared Memory
Multiprocessors,” BIT, vol. 40, no. 1, 2000, pp. 158–176.
nex  ney  nez  #dof's  cpu time (min)  cpu time (min)  [3] B. M. Irons, “A frontal solution scheme for finite element analysis,”
Numer. Meth. Engg., vol. 2, 1970, pp. 5-32.
50  10  10  18513  0.34  0.152 
[4] M. P. Raju, and J. S. T’ien, “Development of Direct Multifrontal Solvers
100  10  10  36663  0.8  0.327  for Combustion Problems,” Numerical Heat Transfer-Part B, vol. 53,
2008, pp. 1-17.
200  10  10  72963  1.78  0.77  [5] M. P. Raju, and J. S. T’ien, “Modelling of Candle Wick Burning with a
Self-trimmed Wick,” Comb. Theory Modell., vol. 12, no. 2, 2008, pp.
50  20  10  35343  0.99  0.405 
367-388.
100  20  10  69993  2.35  0.937  [6] M. P. Raju, and J. S. T’ien, “Two-phase flow inside an externally heated
axisymmetric porous wick,” vol. 11, no. 8, 2008, pp. 701-718.
100  20  20  133623  7.65  2.53  [7] P. K. Gupta, and K. V. Pagalthivarthi, “Application of Multifrontal and
GMRES Solvers for Multisize Particulate Flow in Rotating Channels,”
100  50  20  324513  39.51  11.83 
Prog. Comput Fluid Dynam., vol. 7, 2007, pp. 323–336.
100  50  50  788103  468.1  122.5  [8] S. Khaitan, J. McCalley, Q. Chen, "Multifrontal solver for online power
system time-domain simulation," IEEE Transactions on Power Systems,
50  20  20  67473  3.27  1.09  vol. 23, no. 4, 2008, pp. 1727–1737.
[9] S. Khaitan, C. Fu, J. D. McCalley, "Fast parallelized algorithms for
50  50  10  85833  3.57  1.247 
online extended-term dynamic cascading analysis," PSCE, 2009, pp. 1–
50  50  20  163863  14.4  4.37  7.
[10] J. McCalley, S. Khaitan, “Risk of Cascading outages”, Final Report,
50  50  50  397953  106.4  27.98  PSrec Report, S-26, August 2007. Available at
http://www.pserc.org/docsa/Executive_Summary_Dobson_McCalley_C
150  75  30  1067268  693  181.8 
ascading_Outage_ S-2626_PSERC_ Final_Report.pdf
[11] P. R. Amestoy, and I. S. Duff, “Vectorization of a multiprocessor
Table 9 shows the comparison of CPU times for Newton multifrontal code,” International Journal of Supercomputer
and modified Newton methods for different grid sizes. It is Applications, vol. 3, 1989, pp. 41–59.
[12] P. R. Amestoy, I. S. Duff, J. Koster and J. Y. L’Excellent, “A fully
clearly observed that the implementation of modified Newton asynchronous multifrontal solver using distributed dynamic scheduling,”
methods leads to significant savings in computational time. SIAM Journal on Matrix Analysis and Applications, vol. 23, no. 1, 2001,
Further it is observed that the number of degrees of freedom pp. 15–41.

International Scholarly and Scientific Research & Innovation 3(9) 2009 644 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Mathematical and Computational Sciences
Vol:3, No:9, 2009

[13] P. R. Amestoy, I. S. Duff, and J. Y. L’Excellent, “Multifrontal parallel


distributed symmetric and unsymmetric solvers,” Comput. Methods
Appl. Mech. Eng., vol. 184, 2000, pp. 501–520.
[14] O. Schenk, “Scalable Parallel Sparse LU Factorization Methods on
Shared Memory Multiprocessors,” Ph.D. dissertation, ETH Zurich,
2000.
[15] O. Schenk, and K. Gartner, “Sparse Factorization with Two-Level
Scheduling in PARDISO,” in Proc. 10th SIAM conf. Parallel Processing
for Scientific Computing, Portsmouth, Virginia, March 12-14, 2001.
[16] O. Schenk, and K. Gartner, “Two-level scheduling in PARDISO:
Improved Scalability on Shared Memory Multiprocessing Systems,”
Parallel Computing, vol. 28, 2002, pp. 187-197.
[17] O. Schenk, and K. Gartner, “Solving Unsymmetric Sparse Systems of
Linear Equations with PARDISO,” Journal Future Generation
Computer Systems, vol. 20, no. 3, 2004, pp. 475-487.
[18] Intel MKL Reference Manual, Intel® Math Kernel Library (MKL), 2007.
Available: http://www.intel.com/software/products/mkl/
[19] J. A. Scott, Numerical Analysis Group Progress Report, RAL-TR-2008-
001, Rutherford Appleton Laboratory, 2008.
[20] P. R. Amestoy, T. A. Davis, and I. S. Duff, “An approximate minimum
degree ordering algorithm,” SIAM Journal on Matrix Analysis and
Applications, vol. 17, 1996, pp. 886–905.
Open Science Index, Mathematical and Computational Sciences Vol:3, No:9, 2009 waset.org/Publication/4908

[21] P. R. Amestoy, “Recent progress in parallel multifrontal solvers for


unsymmetric sparse matrices,” in Proc. 15th World Congress on
Scientific Computation, Modelling and Applied Mathematics, IMACS,
Berlin, 1997.
[22] J. Schulze, “Towards a tighter coupling of bottom-up and top-down
sparse matrix ordering methods,” BIT, vol. 41, no. 4, 2001, pp. 800–841.
[23] G. Karypis, and V. Kumar, “METIS – A Software Package for
Partitioning Unstructured Graphs, Partitioning Meshes, and Computing
Fill-Reducing Orderings of Sparse Matrices – Version 4.0,” University
of Minnesota, September 1998.

International Scholarly and Scientific Research & Innovation 3(9) 2009 645 ISNI:0000000091950263

You might also like