Example Programs For Cvode v4.0.0
Example Programs For Cvode v4.0.0
December 7, 2018
UCRL-SM-208110
DISCLAIMER
This document was prepared as an account of work sponsored by an agency of the United
States government. Neither the United States government nor Lawrence Livermore National
Security, LLC, nor any of their employees makes any warranty, expressed or implied, or as-
sumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any
information, apparatus, product, or process disclosed, or represents that its use would not
infringe privately owned rights. Reference herein to any specific commercial product, pro-
cess, or service by trade name, trademark, manufacturer, or otherwise does not necessarily
constitute or imply its endorsement, recommendation, or favoring by the United States gov-
ernment or Lawrence Livermore National Security, LLC. The views and opinions of authors
expressed herein do not necessarily state or reflect those of the United States government or
Lawrence Livermore National Security, LLC, and shall not be used for advertising or product
endorsement purposes.
This work was performed under the auspices of the U.S. Department of Energy by Lawrence
Livermore National Laboratory under Contract DE-AC52-07NA27344.
8 Parallel tests 28
References 30
1 Introduction
This report is intended to serve as a companion document to the User Documentation of
cvode [2]. It provides details, with listings, on the example programs supplied with the
cvode distribution package.
The cvode distribution contains examples of six types: serial C examples, parallel C
examples, serial and parallel Fortran examples, an OpenMP example, and a hypre example.
With the exception of ”demo”-type example files, the names of all the examples distributed
with sundials are of the form [slv][PbName]_[ls]_[prec]_[p], where
[slv] identifies the solver (for cvode examples this is cv, while for fcvode examples, this is
fcv);
[ls] identifies the linear solver module used (for examples using fixed-point iteration for the
nonlinear system solver, non specifies that no linear solver was used);
[prec] indicates the cvode preconditioner module used, bp for cvbandpre or bbd for cvbb-
dpre (only if applicable, for examples using a Krylov linear solver);
[p] indicates an example using the parallel vector module nvector parallel.
• cvRoberts dns solves a chemical kinetics problem consisting of three rate equations.
This program solves the problem with the BDF method and Newton iteration, with the
sunlinsol dense linear solver, cvls interface, and a user-supplied Jacobian routine.
It also uses the rootfinding feature of cvode.
• cvRoberts dns constraints is the same as cvRoberts dns but imposes the constraint
u ≥ 0.0 for all components.
• cvRoberts dnsL is the same as cvRoberts dns but uses the LAPACK implementation
of sunlinsol lapackdense.
• cvRoberts dns uw is the same as cvRoberts dns but demonstrates the user-supplied
error weight function feature of cvode.
• cvRoberts klu is the same as cvRoberts dns but uses the klu sparse direct linear
solver, sunlinsol klu.
• cvRoberts sps is the same as cvRoberts dns but uses the superlumt sparse direct
linear solver, sunlinsol superlumt (with one thread).
1
• cvAdvDiff bndL is the same as cvAdvDiff bnd but uses the LAPACK implementation
of sunlinsol lapackband.
• cvDiurnal kry solves the semi-discrete form of a two-species diurnal kinetics advection-
diffusion PDE system in 2-D.
The problem is solved with the BDF/GMRES method (i.e. using the sunlinsol spgmr
linear solver and cvls interface) and the block-diagonal part of the Newton matrix as
a left preconditioner. A copy of the block-diagonal part of the Jacobian is saved and
conditionally reused within the preconditioner setup routine.
• cvDiurnal kry bp solves the same problem as cvDiurnal kry, with the BDF/GMRES
method and a banded preconditioner, generated by difference quotients, using the mod-
ule cvbandpre.
The problem is solved twice: with preconditioning on the left, then on the right.
• cvKrylovDemo ls solves the same problem as cvDiurnal kry, with the BDF method,
but with three Krylov linear solvers: sunlinsol spgmr, sunlinsol spbcgs, and sun-
linsol sptfqmr.
• cvHeat2D klu solves a discretized 2D heat equation using the klu sparse-direct linear
solver, sunlinsol klu.
2
• cvAdvDiff non p solves the semi-discrete form of a 1-D advection-diffusion equation.
This program solves the problem with the option for nonstiff systems, i.e. Adams
method and fixed-point iteration.
• cvAdvDiff diag p solves the same problem as cvAdvDiff non p, with the Adams
method, but with Newton iteration and the CVDiag linear solver.
• cvDiurnal kry bbd p solves the same problem as cvDiurnal kry p, with BDF and the
GMRES linear solver, using a block-diagonal matrix with banded blocks as a precon-
ditioner, generated by difference quotients, using the module cvbbdpre.
• fcvRoberts dns constraints is the same as fcvRoberts dns but but imposes the
constraint u ≥ 0.0 for all components.
• fcvRoberts dnsL is the same as fcvRoberts dns but uses the Lapack implementation
of sunlinsol lapackdense.
• fcvDiag kry bbd p is the same as the fcvDiag kry p example but using the fcvbbd
module.
In the following sections, we give detailed descriptions of some (but not all) of these examples.
We also give our output files for each of these examples, but users should be cautioned that
their results may differ slightly from these. Differences in solution values may differ within
3
the tolerances, and differences in cumulative counters, such as numbers of steps or Newton
iterations, may differ from one machine environment to another by as much as 10% to 20%.
The final section of this report describes a set of tests done with the parallel version of
CVODE, using a problem based on the cvDiurnal kry/cvDiurnal kry p example.
In the descriptions below, we make frequent references to the cvode User Document [2].
All citations to specific sections (e.g. §4.2) are references to parts of that User Document,
unless explicitly stated otherwise.
Note. The examples in the cvode distribution are written in such a way as to compile and
run for any combination of configuration options during the installation of sundials (see
Appendix A in the User Guide). As a consequence, they contain portions of code that will
not be typically present in a user program. For example, all C example programs make use of
the variables SUNDIALS EXTENDED PRECISION and SUNDIALS DOUBLE PRECISION to test if the
solver libraries were built in extended or double precision, and use the appropriate conversion
specifiers in printf functions.
4
2 Serial example problems
2.1 A dense example: cvRoberts dns
As an initial illustration of the use of the cvode package for the integration of IVP ODEs,
we give a sample program called cvRoberts dns.c. It uses the cvode linear solver inter-
face cvls with dense matrix and linear solver modules (sunmatrix dense and sunlin-
sol dense) and the nvector serial module (which provides a serial implementation of
nvector) in the solution of a 3-species chemical kinetics problem.
The problem consists of the following three rate equations:
ẏ1 = −0.04 · y1 + 104 · y2 · y3
ẏ2 = 0.04 · y1 − 104 · y2 · y3 − 3 · 107 · y22 (1)
7
ẏ3 = 3 · 10 · y22
on the interval t ∈ [0, 4 · 1010 ], with initial conditions y1 (0) = 1.0, y2 (0) = y3 (0) = 0.0.
While integrating the system, we also use the rootfinding feature to find the points at which
y1 = 10−4 or at which y3 = 0.01.
For the source we give a rather detailed explanation of the parts of the program and their
interaction with cvode.
Following the initial comment block, this program has a number of #include lines, which
allow access to useful items in cvode header files. The sundials types.h file provides the
definition of the type realtype (see §4.2 for details). For now, it suffices to read realtype as
double. The cvode.h file provides prototypes for the cvode functions to be called (excluding
the linear solver selection function), and also a number of constants that are to be used in
setting input arguments and testing the return value of CVode. The sunlinsol dense.h file is
the header file for the dense implementation of the sunlinsol module and includes definitions
of the SUNLinearSolver type. Similarly, the sunmatrix dense.h file is the header file for the
dense implementation of the sunmatrix module, including definitions of the SUNMatrix type
as well as macros and functions to access matrix components. We have explicitly included
sunmatrix dense.h, but this is not necessary because it is included by sunlinsol dense.h.
The nvector serial.h file is the header file for the serial implementation of the nvector
module and includes definitions of the N Vector type, a macro to access vector components,
and prototypes for the serial implementation specific machine environment memory allocation
and freeing functions.
This program includes two user-defined accessor macros, Ith and IJth, that are useful in
writing the problem functions in a form closely matching the mathematical description of the
ODE system, i.e. with components numbered from 1 instead of from 0. The Ith macro is used
to access components of a vector of type N Vector with a serial implementation. It is defined
using the nvector serial accessor macro NV Ith S which numbers components starting
with 0. The IJth macro is used to access elements of a dense matrix of type SUNMatrix. It is
similarly defined using the sunmatrix dense accessor macro SM ELEMENT D which numbers
matrix rows and columns starting with 0. The macro NV Ith S is fully described in §6.2. The
macro SM ELEMENT D is fully described in §7.2.
Next, the program includes some problem-specific constants, which are isolated to this
early location to make it easy to change them as needed. The program prologue ends with
prototypes of four private helper functions and the three user-supplied functions that are
called by cvode.
5
The main program begins with some dimensions and type declarations, including use
of the generic types N Vector, SUNMatrix and SUNLinearSolver. The next several lines
allocate memory for the y and abstol vectors using N VNew Serial with a length argument
of NEQ (= 3). The lines following that load the initial values of the dependendent variable
vector into y and the absolute tolerances into abstol using the Ith macro.
The calls to N VNew Serial, and also later calls to CVode*** functions, make use of a
private function, check flag, which examines the return value and prints a message if there
was a failure. The check flag function was written to be used for any serial sundials
application.
The call to CVodeCreate creates the cvode solver memory block, specifying the CV BDF
integration method with CV NEWTON iteration. Its return value is a pointer to that memory
block for this problem. In the case of failure, the return value is NULL. This pointer must be
passed in the remaining calls to cvode functions.
The call to CVodeInit allocates and initializes the solver memory block. Its arguments
include the name of the C function f defining the right-hand side function f (t, y), and the
initial values of t and y. The call to CVodeSVtolerances specifies a vector of absolute
tolerances, and includes the value of the relative tolerance reltol and the absolute tolerance
vector abstol. See §4.5.1 and §4.5.2 for full details of these calls.
The call to CVodeRootInit specifies that a rootfinding problem is to be solved along with
the integration of the ODE system, that the root functions are specified in the function g,
and that there are two such functions. Specifically, they are set to y1 − 0.0001 and y3 − 0.01,
respectively. See §4.5.5 for a detailed description of this call.
The call to SUNDenseMatrix (see §7.2) creates a NEQ×NEQ dense sunmatrix object to use
within the Newton solve in cvode. The following call to SUNLinSol Dense (see §8.5) creates
the dense sunlinsol object that will perform the linear solves within the Newton method.
These are attached to the cvls linear solver interface with the call to CVodeSetLinearSolver
(see §4.5.3), and the subsequent call to CVodeSetJacFn (see §4.5.7) specifies the analytic
Jacobian supplied by the user-supplied function Jac.
The actual solution of the ODE initial value problem is accomplished in the loop over
values of the output time tout. In each pass of the loop, the program calls CVode in the
CV NORMAL mode, meaning that the integrator is to take steps until it overshoots tout and
then interpolate to t =tout, putting the computed value of y(tout) into y, with t = tout.
The return value in this case is CV SUCCESS. However, if CVode finds a root before reaching
the next value of tout, it returns CV ROOT RETURN and stores the root location in t and the
solution there in y. In either case, the program prints t and y. In the case of a root, it calls
CVodeGetRootInfo to get a length-2 array rootsfound of bits showing which root function
was found to have a root. If CVode returned any negative value (indicating a failure), the
program breaks out of the loop. In the case of a CV SUCCESS return, the value of tout is
advanced (multiplied by 10) and a counter (iout) is advanced, so that the loop can be ended
when that counter reaches the preset number of output times, NOUT = 12. See §4.5.6 for full
details of the call to CVode.
Finally, the main program calls PrintFinalStats to get and print all of the relevant
statistical quantities. It then calls NV Destroy to free the vectors y and abstol, CVodeFree
to free the cvode memory block, SUNLinSolFree to free the linear solver memory, and
SUNMatDestroy to free the matrix A.
The function PrintFinalStats used here is actually suitable for general use in appli-
cations of cvode to any problem with a direct linear solver. It calls various CVodeGet***
functions to obtain the relevant counters, and then prints them. Specifically, these are: the
6
cumulative number of steps (nst), the number of f evaluations (nfe) (excluding those for
difference-quotient Jacobian evaluations), the number of matrix factorizations (nsetups), the
number of f evaluations for Jacobian evaluations (nfeLS = 0 here), the number of Jacobian
evaluations (nje), the number of nonlinear (Newton) iterations (nni), the number of nonlin-
ear convergence failures (ncfn), the number of local error test failures (netf), and the number
of g (root function) evaluations (nge). These optional outputs are described in §4.5.9.
The function f is a straightforward expression of the ODEs. It uses the user-defined
macro Ith to extract the components of y and to load the components of ydot. See §4.6.1
for a detailed specification of f.
Similarly, the function g defines the two functions, g0 and g1 , whose roots are to be found.
See §4.6.4 for a detailed description of the g function.
The function Jac sets the nonzero elements of the Jacobian as a dense matrix. (Zero
elements need not be set because J is preset to zero.) It uses the user-defined macro IJth to
reference the elements of a dense matrix of type sunmatrix. Here the problem size is small,
so we need not worry about the inefficiency of using NV Ith S and SM ELEMENT D to access
N Vector and sunmatrix dense elements. Note that in this example, Jac only accesses the
y and J arguments. See §4.6.5 for a detailed description of the Jac function.
The output generated by cvRoberts dns is shown below. It shows the output values at
the 12 preset values of tout. It also shows the two root locations found, first at a root of g1 ,
and then at a root of g0 .
Final Statistics :
nst = 542 nfe = 754 nsetups = 107 nfeLS = 0 nje = 11
nni = 751 ncfn = 0 netf = 22 nge = 570
7
on a rectangle, with zero Dirichlet boundary conditions. The PDE is discretized with standard
central finite differences on a (MX+2) × (MY+2) mesh, giving an ODE system of size MX*MY.
The discrete value vij approximates v at x = i∆x, y = j∆y. The ODEs are
dvij vi−1,j − 2vij + vi+1,j vi+1,j − vi−1,j vi,j−1 − 2vij + vi,j+1
= fij = + .5 + , (3)
dt (∆x)2 2∆x (∆y)2
where 1 ≤ i ≤MX and 1 ≤ j ≤MY. The boundary conditions are imposed by taking vij = 0
above if i = 0 or MX+1, or if j = 0 or MY+1. If we set u(j−1)+(i−1)∗MY = vij , so that the ODE
system is u̇ = f (u), then the system Jacobian J = ∂f /∂u is a band matrix with upper and
lower half-bandwidths both equal to MY. In the example, we take MX = 10 and MY = 5.
The cvAdvDiff bnd.c program includes files sunmatrix band.h and sunlinsol band.h
in order to use the sunlinsol band linear solver. The sunmatrix band.h file contains the
definition of the banded sunmatrix type, and the SM COLUMN B and SM COLUMN ELEMENT B
macros for accessing banded matrix elements (see §7.3). The sunlinsol band.h file con-
tains the definition of the banded sunlinsol type. We note that have explicitly included
sunmatrix band.h, but this is not necessary because it is included by sunlinsol band.h.
The file nvector serial.h is included for the definition of the serial N Vector type.
The include lines at the top of the file are followed by definitions of problem constants
which include the x and y mesh dimensions, MX and MY, the number of equations NEQ, the
scalar absolute tolerance ATOL, the initial time T0, and the initial output time T1.
Spatial discretization of the PDE naturally produces an ODE system in which equations
are numbered by mesh coordinates (i, j). The user-defined macro IJth isolates the translation
for the mathematical two-dimensional index to the one-dimensional N Vector index and
allows the user to write clean, readable code to access components of the dependent variable.
The NV DATA S macro returns the component array for a given N Vector, and this array is
passed to IJth in order to do the actual N Vector access.
The type UserData is a pointer to a structure containing problem data used in the f
and Jac functions. This structure is allocated and initialized at the beginning of main. The
pointer to it, called data, is passed to CVodeSetUserData, and as a result it will be passed
back to the f and Jac functions each time they are called. The use of the data pointer
eliminates the need for global program data.
The main program is straightforward. The CVodeCreate call specifies the CV BDF method
with a CV NEWTON iteration. Following the CVodeInit call, the call to CVodeSStolerances in-
dicates scalar relative and absolute tolerances, and values reltol and abstol are passed. The
call to SUNBandMatrix (see §7.3) creates a banded sunmatrix Jacobian template, and speci-
fies that both half-bandwidths of the Jacobian are equal to MY. The calls to SUNBandLinearSolver
(see §8.6) and CVodeSetLinearSolver (see §4.5.3) specifies the sunlinsol band linear solver
to the cvls interface. The call to CVodeSetJacFn (see §4.5.7) specifies that a user-supplied
Jacobian function Jac is to be used.
The actual solution of the problem is performed by the call to CVode within the loop over
the output times tout. The max-norm of the solution vector (from a call to N VMaxNorm)
and the cumulative number of time steps (from a call to CVodeGetNumSteps) are printed at
each output time. Finally, the calls to PrintFinalStats, N VDestroy, and CVodeFree print
statistics and free problem memory.
Following the main program in the cvAdvDiff bnd.c file are definitions of five functions:
f, Jac, SetIC, PrintHeader, PrintOutput, PrintFinalStats, and check flag. The last
five functions are called only from within the cvAdvDiff bnd.c file. The SetIC function sets
the initial dependent variable vector; PrintHeader prints the heading of the output page;
8
PrintOutput prints a line of solution output; PrintFinalStats gets and prints statistics at
the end of the run; and check flag aids in checking return values. The statistics printed
include counters such as the total number of steps (nst), f evaluations (excluding those
for Jaobian evaluations) (nfe), LU decompositions (nsetups), f evaluations for difference-
quotient Jacobians (nfeLS = 0 here), Jacobian evaluations (nje), and nonlinear iterations
(nni). These optional outputs are described in §4.5.9. Note that PrintFinalStats is suitable
for general use in applications of cvode to any problem with a direct linear solver.
The f function implements the central difference approximation (3) with u identically zero
on the boundary. The constant coefficients (∆x)−2 , .5(2∆x)−1 , and (∆y)−2 are computed
only once at the beginning of main, and stored in the locations data->hdcoef, data->hacoef,
and data->vdcoef, respectively. When f receives the data pointer (renamed user data
here), it pulls out these values from storage in the local variables hordc, horac, and verdc.
It then uses these to construct the diffusion and advection terms, which are combined to form
udot. Note the extra lines setting out-of-bounds values of u to zero.
The Jac function is an expression of the derivatives
∂fij /∂vij = −2[(∆x)−2 + (∆y)−2 ]
∂fij /∂vi±1,j = (∆x)−2 ± .5(2∆x)−1 , ∂fij /∂vi,j±1 = (∆y)−2 .
This function loads the Jacobian by columns, and like f it makes use of the preset coefficients
in data. It loops over the mesh points (i,j). For each such mesh point, the one-dimensional
index k = j-1 + (i-1)*MY is computed and the kth column of the Jacobian matrix J is set.
The row index k 0 of each component fi0 ,j 0 that depends on vi,j must be identified in order
to load the corresponding element. The elements are loaded with the SM COLUMN ELEMENT B
macro. Note that the formula for the global index k implies that decreasing (increasing)
i by 1 corresponds to decreasing (increasing) k by MY, while decreasing (increasing) j by 1
corresponds of decreasing (increasing) k by 1. These statements are reflected in the arguments
to SM COLUMN ELEMENT B. The first argument passed to the SM COLUMN ELEMENT B macro is a
pointer to the diagonal element in the column to be accessed. This pointer is obtained via
a call to the SM COLUMN B macro and is stored in kthCol in the Jac function. When setting
the components of J we must be careful not to index out of bounds. The guards (i != 1)
etc. in front of the calls to SM COLUMN ELEMENT B prevent illegal indexing. See §4.6.5 for a
detailed description of the Jac function.
The output generated by cvAdvDiff bnd is shown below.
cvAdvDiff bnd sample output
9
At t = 1.00 max . norm ( u ) = 6.556853 e -05 nst = 142
Final Statistics :
nst = 142 nfe = 173 nsetups = 23 nfeLS = 0 nje = 3
nni = 170 ncfn = 0 netf = 3
The spatial domain is 0 ≤ x ≤ 20, 30 ≤ y ≤ 50 (in km). The various constants and
parameters are: Kh = 4.0 · 10−6 , V = 10−3 , Kv = 10−8 exp(y/5), q1 = 1.63 · 10−16 , q2 =
4.66 · 10−16 , c3 = 3.7 · 1016 , and the diurnal rate constants are defined as:
exp[−ai / sin ωt], for sin ωt > 0
qi (t) = (i = 3, 4) ,
0, for sin ωt ≤ 0
where ω = π/43200, a3 = 22.62, a4 = 7.601. The time interval of integration is [0, 86400],
representing 24 hours measured in seconds.
Homogeneous Neumann boundary conditions are imposed on each boundary, and the
initial conditions are
c1 (x, y, 0) = 106 α(x)β(y) , c2 (x, y, 0) = 1012 α(x)β(y) ,
α(x) = 1 − (0.1x − 1)2 + (0.1x − 1)4 /2 , (6)
2 4
β(y) = 1 − (0.1y − 4) + (0.1y − 4) /2 .
For this example, the equations (4) are discretized spatially with standard central finite
differences on a 10 × 10 mesh, giving an ODE system of size 200.
Among the initial #include lines in this case are lines to include sunlinsol spgmr.h and
sundials math.h. The first contains constants and function prototypes associated with the
sunlinsol spgmr module, including the values of the pretype argument to SUNLinSol SPGMR.
The inclusion of sundials math.h is done to access the SUNSQR macro for the square of a
realtype number.
The main program calls CVodeCreate specifying the CV BDF method and CV NEWTON iter-
ation, and then calls CVodeInit, and CVodeSetSStolerances specifies the scalar tolerances.
10
It calls SUNLinSol SPGMR to create the spgmr linear solver with left preconditioning, and the
default value (indicated by a zero argument) for maxl. It then calls CVodeSetLinearSolver
(see §4.5.3) to attach this linear solver to the cvls interface. The call to CVodeSetJacTimes
specifies a user-supplied function for Jacobian-vector products (the NULL argument speci-
fies that no Jacobian-vector setup routine is needed). Next, user-supplied preconditioner
setup and solve functions, Precond and PSolve, are specified. See §4.5.7 for details on the
CVodeSetPreconditioner function.
For a sequence of tout values, CVode is called in the CV NORMAL mode, sampled output
is printed, and the return value is tested for error conditions. After that, PrintFinalStats
is called to get and print final statistics, and memory is freed by calls to N VDestroy,
FreeUserData, and CVodeFree. The printed statistics include various counters, such as the
total numbers of steps (nst), of f evaluations (excluding those for Jv product evaluations)
(nfe), of f evaluations for Jv evaluations (nfeLS), of nonlinear iterations (nni), of linear
(Krylov) iterations (nli), of preconditioner setups (nsetups), of preconditioner evaluations
(npe), and of preconditioner solves (nps), among others. Also printed are the lengths of the
problem-dependent real and integer workspaces used by the main integrator CVode, denoted
lenrw and leniw, and those used by cvls, denoted lenrwLS and leniwLS. All of these op-
tional outputs are described in §4.5.9. The PrintFinalStats function is suitable for general
use in applications of cvode to any problem with an iterative linear solver.
Mathematically, the dependent variable has three dimensions: species number, x mesh
point, and y mesh point. But in nvector serial, a vector of type N Vector works with a
one-dimensional contiguous array of data components. The macro IJKth isolates the transla-
tion from three dimensions to one. Its use results in clearer code and makes it easy to change
the underlying layout of the three-dimensional data. Here the problem size is 200, so we use
the NV DATA S macro for efficient N Vector access. The NV DATA S macro gives a pointer to
the first component of an N Vector which we pass to the IJKth macro to do an N Vector
access.
The preconditioner used here is the block-diagonal part of the true Newton matrix. It
is generated and factored in the Precond routine (see §4.6.9) and backsolved in the PSolve
routine (see §4.6.8). Its diagonal blocks are 2 × 2 matrices that include the interaction
Jacobian elements and the diagonal contribution of the diffusion Jacobian elements. The
block-diagonal part of the Jacobian itself, Jbd , is saved in separate storage each time it
is generated, on calls to Precond with jok == SUNFALSE. On calls with jok == SUNTRUE,
signifying that saved Jacobian data can be reused, the preconditioner P = I − γJbd is formed
from the saved matrix Jbd and factored. (A call to Precond with jok == SUNTRUE can only
occur after a prior call with jok == SUNFALSE.) The Precond routine must also set the value
of jcur, i.e. *jcurPtr, to SUNTRUE when Jbd is re-evaluated, and SUNFALSE otherwise, to
inform cvls of the status of Jacobian data.
We need to take a brief detour to explain one last important aspect of this program.
While the generic sunlinsol dense linear solver module serves as the interface to dense
matrix solves for the main sundials solvers, the underlying algebraic operations operate on
dense matrices with realtype ** as the underlying dense matrix type. To avoid the extra
layer of function calls and dense matrix and linear solver data structures, cvDiurnal kry.c
uses underling small dense functions for all operations on the 2 × 2 preconditioner blocks.
Thus it includes sundials dense.h, and calls the small dense matrix functions newDenseMat,
newIndexArray, denseCopy, denseScale, denseAddIdentity, denseGETRF, and denseGETRS.
The macro IJth defined near the top of the file is used to access individual elements in each
preconditioner block, numbered from 1. The underlying dense algebra functions are available
11
for cvode user programs generally.
In addition to the functions called by cvode, cvDiurnal kry.c includes definitions of
several private functions. These are: AllocUserData to allocate space for Jbd , P , and the
pivot arrays; InitUserData to load problem constants in the data block; FreeUserData to
free that block; SetInitialProfiles to load the initial values in y; PrintOutput to retreive
and print selected solution values and statistics; PrintFinalStats to print statistics; and
check flag to check return values for error conditions.
The output generated by cvDiurnal kry.c is shown below. Note that the number of
preconditioner evaluations, npe, is much smaller than the number of preconditioner setups,
nsetups, as a result of the Jacobian re-use scheme.
12
c1 ( bot . left / middle / top rt .) = -1.624 e -12 -1.151 e -10 -2.246 e -12
c2 ( bot . left / middle / top rt .) = 3.334 e +11 6.669 e +11 4.120 e +11
Final Statistics ..
13
3 Parallel example problems
3.1 A nonstiff example: cvAdvDiff non p
This problem begins with a simple diffusion-advection equation for u = u(t, x)
∂u ∂2u ∂u
= 2
+ 0.5 (7)
∂t ∂x ∂x
for 0 ≤ t ≤ 5, 0 ≤ x ≤ 2, and subject to homogeneous Dirichlet boundary conditions and
initial values given by
A system of MX ODEs is obtained by discretizing the x-axis with MX+2 grid points and
replacing the first and second order spatial derivatives with their central difference approxi-
mations. Since the value of u is constant at the two endpoints, the semi-discrete equations
for those points can be eliminated. With ui as the approximation to u(t, xi ), xi = i(∆x), and
∆x = 2/(MX+1), the resulting system of ODEs, u̇ = f (t, u), can now be written:
ui+1 − 2ui + ui−1 ui+1 − ui−1
u̇i = + 0.5 . (9)
(∆x)2 2(∆x)
This equation holds for i = 1, 2, . . . , MX, with the understanding that u0 = uMX+1 = 0.
In the parallel processing environment, we may think of the several processors as being
laid out on a straight line with each processor to compute its contiguous subset of the solution
vector. Consequently the computation of the right hand side of Eq. (9) requires that each
interior processor must pass the first component of its block of the solution vector to its left-
hand neighbor, acquire the last component of that neighbor’s block, pass the last component
of its block of the solution vector to its right-hand neighbor, and acquire the first component of
that neighbor’s block. If the processor is the first (0th) or last processor, then communication
to the left or right (respectively) is not required.
This problem uses the Adams (non-stiff) integration formula and fixed-point nonlinear
solver. It is unrealistically simple, but serves to illustrate use of the parallel version of
CVODE.
The cvAdvDiff non p.c file begins with #include declarations for various required header
files, including lines for nvector parallel to access the parallel N Vector type and related
macros, and for mpi.h to access MPI types and constants. Following that are definitions
of problem constants and a data block for communication with the f routine. That block
includes the number of PEs, the index of the local PE, and the MPI communicator.
The main program begins with MPI calls to initialize MPI and to set multi-processor
environment parameters npes (number of PEs) and my pe (local PE index). The local vector
length is set according to npes and the problem size NEQ (which may or may not be multiple
of npes). The value my base is the base value for computing global indices (from 1 to NEQ)
for the local vectors. The solution vector u is created with a call to N VNew Parallel and
loaded with a call to SetIC. The calls to CVodeCreate, CVodeInit, and CVodeSStolerances
specify a cvode solution with the nonstiff method and scalar tolerances. The call to
CVodeSetUserdata insures that the pointer data is passed to the f routine whenever it
is called. A heading is printed (if on processor 0). In a loop over tout values, CVode is called,
14
and the return value checked for errors. The max-norm of the solution and the total number
of time steps so far are printed at each output point. Finally, some statistical counters are
printed, memory is freed, and MPI is finalized.
The SetIC routine uses the last two arguments passed to it to compute the set of global
indices (my base+1 to my base+my length) corresponding to the local part of the solution
vector u, and then to load the corresponding initial values. The PrintFinalStats routine
uses CVodeGet*** calls to get various counters, and then prints these. The counters are: nst
(number of steps), nfe (number of f evaluations), nni (number of nonlinear iterations), netf
(number of error test failures), and ncfn (number of nonlinear convergence failures). This
routine is suitable for general use with cvode applications to nonstiff problems.
The f function is an implementation of Eq. (9), but preceded by communication opera-
tions appropriate for the parallel setting. It copies the local vector u into a larger array z,
shifted by 1 to allow for the storage of immediate neighbor components. The first and last
components of u are sent to neighboring processors with MPI Send calls, and the immedi-
ate neighbor solution values are received from the neighbor processors with MPI Recv calls,
except that zero is loaded into z[0] or z[my length+1] instead if at the actual boundary.
Then the central difference expressions are easily formed from the z array, and loaded into
the data array of the udot vector.
The cvAdvDiff non p.c file includes a routine check flag that checks the return values
from calls in main. This routine was written to be used by any parallel sundials application.
The output below is for cvAdvDiff non p with MX = 10 and four processors. Varying
the number of processors will alter the output, only because of roundoff-level differences in
various vector operations. The fairly high value of ncfn indicates that this problem is on the
borderline of being stiff.
Number of PEs = 2
Final Statistics :
15
3.2 A user preconditioner example: cvDiurnal kry p
As an example of using cvode with the Krylov linear solver sunlinsol spgmr, cvls linear
solver interface, and the parallel MPI nvector parallel module, we describe a test problem
based on the system PDEs given above for the cvDiurnal kry example. As before, we
discretize the PDE system with central differencing, to obtain an ODE system u̇ = f (t, u)
representing (4). But in this case, the discrete solution vector is distributed over many
processors. Specifically, we may think of the processors as being laid out in a rectangle, and
each processor being assigned a subgrid of size MXSUB×MYSUB of the x − y grid. If there are
NPEX processors in the x direction and NPEY processors in the y direction, then the overall grid
size is MX×MY with MX=NPEX×MXSUB and MY=NPEY×MYSUB, and the size of the ODE system
is 2·MX·MY.
To compute f in this setting, the processors pass and receive information as follows. The
solution components for the bottom row of grid points in the current processor are passed
to the processor below it and the solution for the top row of grid points is received from
the processor below the current processor. The solution for the top row of grid points for
the current processor is sent to the processor above the current processor, while the solution
for the bottom row of grid points is received from that processor by the current processor.
Similarly the solution for the first column of grid points is sent from the current processor to
the processor to its left and the last column of grid points is received from that processor by
the current processor. The communication for the solution at the right edge of the processor
is similar. If this is the last processor in a particular direction, then message passing and
receiving are bypassed for that direction.
This code is intended to provide a more realistic example than that in cvAdvDiff non p,
and to provide a template for a stiff ODE system arising from a PDE system. The solution
method is BDF with Newton iteration and spgmr. The left preconditioner is the block-
diagonal part of the Newton matrix, with 2 × 2 blocks, and the corresponding diagonal
blocks of the Jacobian are saved each time the preconditioner is generated, for re-use later
under certain conditions.
The organization of the cvDiurnal kry p program deserves some comments. The right-
hand side routine f calls two other routines: ucomm, which carries out inter-processor commu-
nication; and fcalc, which operates on local data only and contains the actual calculation of
f (t, u). The ucomm function in turn calls three routines which do, respectively, non-blocking
receive operations, blocking send operations, and receive-waiting. All three use MPI, and
transmit data from the local u vector into a local working array uext, an extended copy of
u. The fcalc function copies u into uext, so that the calculation of f (t, u) can be done
conveniently by operations on uext only. Most other features of cvDiurnal kry p.c are the
same as in cvDiurnal kry.c, except for extra logic involved with distributed vectors.
The following is a sample output from cvDiurnal kry p, for four processors (in a 2 × 2
array) with a 5 × 5 subgrid on each. The output will vary slightly if the number of processors
is changed.
16
t = 1.44 e +04 no . steps = 251 order = 5 stepsize = 3.77 e +02
At bottom left : c1 , c2 = 6.659 e +06 2.582 e +11
At top right : c1 , c2 = 7.301 e +06 2.833 e +11
Final Statistics :
17
3.3 A CVBBDPRE preconditioner example: cvDiurnal kry bbd p
In this example, cvDiurnal kry bbd p, we solve the same problem as in cvDiurnal kry p
above, but instead of supplying the preconditioner, we use the cvbbdpre module, which
generates and uses a band-block-diagonal preconditioner. The half-bandwidths of the Jaco-
bian block on each processor are both equal to 2·MXSUB, and that is the value supplied as
mudq and mldq in the call to CVBBDPrecInit. But in order to reduce storage and computa-
tion costs for preconditioning, we supply the values mukeep = mlkeep = 2 (= NVARS) as the
half-bandwidths of the retained band matrix blocks. This means that the Jacobian elements
are computed with a difference quotient scheme using the true bandwidth of the block, but
only a narrow band matrix (bandwidth 5) is kept as the preconditioner.
As in cvDiurnal kry p.c, the f routine in cvDiurnal kry bbd p.c simply calls a com-
munication routine, fucomm, and then a strictly computational routine, flocal. However,
the call to CVBBDPrecInit specifies the pair of routines to be called as ucomm and flocal,
where ucomm is NULL. This is because each call by the solver to ucomm is preceded by a call
to f with the same (t,u) arguments, and therefore the communication needed for flocal in
the solver’s calls to it have already been done.
In cvDiurnal kry bbd p.c, the problem is solved twice — first with preconditioning on
the left, and then on the right. Thus prior to the second solution, calls are made to reset the
initial values (SetInitialProfiles), the main solver memory (CVodeReInit), the cvbbd-
pre memory (CVBBDPrecReInit), as well as the preconditioner type (SUNLinSol SPGMRSetPrecType).
Sample output from cvDiurnal kry bbd p follows, again using 5 × 5 subgrids on a 2 × 2
processor grid. The performance of the preconditioner, as measured by the number of Krylov
iterations per Newton iteration, nli/nni, is very close to that of cvDiurnal kry p when
preconditioning is on the left, but slightly poorer when it is on the right.
18
At bottom left : c1 , c2 = 1.404 e +04 3.387 e +11
At top right : c1 , c2 = 1.561 e +04 3.765 e +11
Final Statistics :
-------------------------------------------------------------------
19
t = 2.16 e +04 no . steps = 249 order = 5 stepsize = 4.31 e +02
At bottom left : c1 , c2 = 2.665 e +07 2.993 e +11
At top right : c1 , c2 = 2.931 e +07 3.313 e +11
Final Statistics :
20
4 hypre example problems
4.1 A nonstiff example: cvAdvDiff non ph
This example is same as cvAdvDiff non p, except that it uses the hypre vector type instead of
the sundials native parallel vector implementation. The outputs from the two examples are
identical. In the following, we will point out only the differences between the two. Familiarity
with hypre library [1] is helpful.
We use the hypre IJ vector interface to allocate the template vector and create parallel
partitioning:
The initialize call means that vector elements are ready to be set using the IJ interface. We
choose an initial condition vector x0 = x(t0 ) as the template vector and we set its values in
the SetIC(...) function. We complete the hypre vector assembly by:
HYPRE_IJVectorAssemble(Uij);
HYPRE_IJVectorGetObject(Uij, (void**) &Upar);
The assemble call is collective and it makes hypre vector ready to use. This sets the handle
Upar to the actual hypre vector. The handle is then passed to the N_VMake function, which
creates the template N_Vector as a wrapper around the hypre vector. All other vectors
in the computation are created by cloning the template vector. The template vector does
not own the underlying hypre vector, and it is the user’s responsibility to destroy it using
a HYPRE_IJVectorDestroy(Uij) call after the template vector has been destroyed. This
function will destroy both the hypre vector and its IJ interface.
To access individual elements of solution vectors u and udot in the residual function, the
user needs to extract the hypre vector first by calling N_VGetVector_ParHyp, and then use
hypre methods from that point on.
Notes
• At this point interfaces to hypre solvers and preconditioners are not available. They !
will be provided in subsequent sundials releases. The interface to hypre vector is
included in this release mainly for testing purposes and as a preview of functionality to
come.
21
5 CUDA example problems
5.1 An unpreconditioned Krylov example: cvAdvDiff kry cuda
The example program cvAdvDiff kry cuda.cu solves the same 2-D advection-diffusion equa-
tion as in Section 2.2, but instead of using a banded direct solver, it uses unpreconditioned
Krylov solver. Here we only highlight differences between the two examples.
The cvAdvDiff kry cuda.cu program includes files sunlinsol spmgr.h in order to use
the spgmr Krylov linear solver. File cvode.h provides the prototypes for CVodeSetLinearSolver,
which sets the iterative linear solver for cvode, and CVodeSetJacTimes, which sets the
pointer to the user supplied Jacobian-vector product function. The file nvector cuda.h is
included for the definition of the cuda N Vector type. The prototype vector is created using
N VNew Cuda function.
In order to get a good performance and avoid moving data between host and device at
every iteration, it is recommended that user evaluates model at the device. In the example,
model right hand side and Jacobian-vector product are implemented as cuda kernels fKernel
and jtvKernel, respectively. User provided C functions f and jtv, which are called directly
by cvode, set thread partitioning and launch thier respective cuda kernels. Vector data on
the device is accessed using N VGetDeviceArrayPointer Cuda function.
The output generated by cvAdvDiff kry cuda is shown below.
Final Statistics ..
22
6 RAJA example problems
6.1 An unpreconditioned Krylov example: cvAdvDiff kry raja
The example program cvAdvDiff kry raja.cu solves the same 2-D advection-diffusion equa-
tion as in Sections 2.2 and 5.1.
The file nvector raja.h contains the definition of the raja N Vector type, and RAJA.hpp
definition of the raja forall loops. The prototype vector in the main body of the program
is created using N VNew Raja function.
In order to get a good performance and avoid moving data between host and device at ev-
ery iteration, it is recommended that user evaluates model at the device. In the example, user-
supplied model right hand side and Jacobian-vector product functions, f and jtv, operate on
the device data. Vector data on the device is accessed using N VGetDeviceArrayPointer Raja
function. Looping over vector components is implemented using raja forall loops.
The output generated by cvAdvDiff kry raja is shown below.
Final Statistics ..
23
7 Fortran example problems
The Fortran example problem programs supplied with the cvode package are all written
in standard Fortran77 and use double precision arithmetic. Before running any of these
examples, the user should make sure that the Fortran data types for real and integer
variables appropriately match the C types. See §5.2.2 in the cvode User Document for
details.
24
c2 ( bot . left / middle / top rt .) = 0.338035 E +12 0.502929 E +12 0.375096 E +12
Final statistics :
25
The source file for this problem begins with MPI calls to initialize MPI and to get the
number of processors and local processor index. FNVINITP is called to initialize the MPI-
parallel nvector module, while FSUNSPGMRINIT and FSUNSPGMRSETGSTYPE are called to
initialize the spgmr sunlinsol module. Following the call to FCVMALLOC, the linear solver
and preconditioner are attached to cvode with calls to FCVLSINIT and FCVBBDINIT. In a
loop over TOUT values, it calls FCVODE and prints the step and f evaluation counters. After
that, it computes and prints the maximum global error, and all the relevant performance
counters. Those specific to cvbbdpre are obtained by a call to FCVBBDOPT. To prepare for
the second run, the program calls FCVREINIT, FCVBBDREINIT, and FSUNSPGMRSETPRECTYPE,
in addition to resetting the initial conditions. Finally, it frees memory and terminates MPI.
Notice that in the FCVFUN routine, the local processor index MYPE and the local vector size
NLOCAL are used to form the global index values needed to evaluate the right-hand side of
Eq. (10).
The following is a sample output from fcvDiag kry bbd p, with NPES = 4. As expected,
the performance is identical for left vs right preconditioning.
NEQ = 20
parameter alpha = 10.000
ydot_i = - alpha * i * y_i ( i = 1 ,... , NEQ )
RTOL , ATOL = 0.1 E -04 0.1 E -09
Method is BDF / NEWTON / SPGMR
Preconditioner is band - block - diagonal , using CVBBDPRE
Number of processors = 2
Preconditioning on left
Final statistics :
26
In CVBBDPRE :
real / int local workspace = 100 60
number of g evals . = 12
------------------------------------------------------------
Preconditioning on right
Final statistics :
27
8 Parallel tests
The stiff example problem cvDiurnal kry described above, or rather its parallel version
cvDiurnal kry p, has been modified and expanded to form a test problem for the parallel
version of cvode. This work was largely carried out by M. Wittman and reported in [3].
To start with, in order to add realistic complexity to the solution, the initial profile for
this problem was altered to include a rather steep front in the vertical direction. Specifically,
the function β(y) in Eq. (6) has been replaced by:
This function rises from about .5 to about 1.0 over a y interval of about .2 (i.e. 1/100 of
the total span in y). This vertical variation, together with the horizonatal advection and
diffusion in the problem, demands a fairly fine spatial mesh to achieve acceptable resolution.
In addition, an alternate choice of differencing is used in order to control spurious oscilla-
tions resulting from the horizontal advection. In place of central differencing for that term,
a biased upwind approximation is applied to each of the terms ∂ci /∂x, namely:
3 1
∂c/∂x|xj ≈ cj+1 − cj − cj−1 /(2∆x) . (12)
2 2
With this modified form of the problem, we performed tests similar to those described
above for the example. Here we fix the subgrid dimensions at MXSUB = MYSUB = 50, so that
the local (per-processor) problem size is 5000, while the processor array dimensions, NPEX
and NPEY, are varied. In one (typical) sequence of tests, we fix NPEY = 8 (for a vertical mesh
size of MY = 400), and set NPEX = 8 (MX = 400), NPEX = 16 (MX = 800), and NPEX = 32 (MX
= 1600). Thus the largest problem size N is 2 · 400 · 1600 = 1, 280, 000. For these tests, we
also raise the maximum Krylov dimension, maxl, to 10 (from its default value of 5).
For each of the three test cases, the test program was run on a Cray-T3D (256 processors)
with each of three different message-passing libraries:
The following table gives the run time and selected performance counters for these 9 runs.
In all cases, the solutions agreed well with each other, showing expected small variations with
grid size. In the table, M-P denotes the message-passing library, RT is the reported run time
in CPU seconds, nst is the number of time steps, nfe is the number of f evaluations, nni is
the number of nonlinear (Newton) iterations, nli is the number of linear (Krylov) iterations,
and npe is the number of evaluations of the preconditioner.
Some of the results were as expected, and some were surprising. For a given mesh size,
variations in performance counts were small or absent, except for moderate (but still accept-
able) variations for SHMEM in the smallest case. The increase in costs with mesh size can
be attributed to a decline in the quality of the preconditioner, which neglects most of the
spatial coupling. The preconditioner quality can be inferred from the ratio nli/nni, which
is the average number of Krylov iterations per Newton iteration. The most interesting (and
unexpected) result is the variation of run time with library: SHMEM is the most efficient,
28
NPEX M-P RT nst nfe nni nli npe
8 MPICH 436. 1391 9907 1512 8392 24
8 EPCC 355. 1391 9907 1512 8392 24
8 SHMEM 349. 1999 10,326 2096 8227 34
16 MPICH 676. 2513 14,159 2583 11,573 42
16 EPCC 494. 2513 14,159 2583 11,573 42
16 SHMEM 471. 2513 14,160 2581 11,576 42
32 MPICH 1367. 2536 20,153 2696 17,454 43
32 EPCC 737. 2536 20,153 2696 17,454 43
32 SHMEM 695. 2536 20,121 2694 17,424 43
Table 1: Parallel cvode test results vs problem size and message-passing library
but EPCC is a very close second, and MPICH loses considerable efficiency by comparison, as
the problem size grows. This means that the highly portable MPI version of cvode, with an
appropriate choice of MPI implementation, is fully competitive with the Cray-specific version
using the SHMEM library. While the overall costs do not prepresent a well-scaled parallel
algorithm (because of the preconditioner choice), the cost per function evaluation is quite
flat for EPCC and SHMEM, at .033 to .037 (for MPICH it ranges from .044 to .068).
For tests that demonstrate speedup from parallelism, we consider runs with fixed problem
size: MX = 800, MY = 400. Here we also fix the vertical subgrid dimension at MYSUB = 50 and
the vertical processor array dimension at NPEY = 8, but vary the corresponding horizontal
sizes. We take NPEX = 8, 16, and 32, with MXSUB = 100, 50, and 25, respectively. The runs for
the three cases and three message-passing libraries all show very good agreement in solution
values and performance counts. The run times for EPCC are 947, 494, and 278, showing
speedups of 1.92 and 1.78 as the number of processors is doubled (twice). For the SHMEM
runs, the times were slightly lower and the ratios were 1.98 and 1.91. For MPICH, consistent
with the earlier runs, the run times were considerably higher, and in fact show speedup ratios
of only 1.54 and 1.03.
29
References
[1] R Falgout and UM Yang. Hypre user’s manual. Technical report, LLNL, 2015.
[2] A. C. Hindmarsh and R. Serban. User Documentation for CVODE v4.0.0. Technical
Report UCRL-SM-208108, LLNL, 2018.
[3] M. R. Wittman. Testing of PVODE, a Parallel ODE Solver. Technical Report UCRL-
ID-125562, LLNL, August 1996.
30