Petsc Manual
Petsc Manual
19
www.anl.gov
DOCUMENT AVAILABILITY
Online Access: U.S. Department of Energy (DOE) reports produced after 1991 and a
growing number of pre-1991 documents are available free at OSTI.GOV
(http://www.osti.gov/), a service of the US Dept. of Energy’s Office of Scientific and
Technical Information.
Reports not in digital format may be purchased by the public from the
National Technical Information Service (NTIS):
U.S. Department of Commerce
National Technical Information Service
5301 Shawnee Rd
Alexandria, VA 22312
www.ntis.gov
Phone: (800) 553-NTIS (6847) or (703) 605-6000
Fax: (703) 605-6900
Email: [email protected]
Reports not in digital format are available to DOE and DOE contractors from the
Office of Scientific and Technical Information (OSTI):
U.S. Department of Energy
Office of Scientific and Technical Information
P.O. Box 62
Oak Ridge, TN 37831-0062
www.osti.gov
Phone: (865) 576-8401
Fax: (865) 576-5728
Email: [email protected]
Disclaimer
This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States
Government nor any agency thereof, nor UChicago Argonne, LLC, nor any of their employees or officers, makes any warranty, express or
implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus,
product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific
commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its
endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of document
authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof, Argonne
National Laboratory, or UChicago Argonne, LLC.
ANL-21/39 Rev 3.19
Prepared by
S. Balay 1 , S. Abhyankar 1,2 , M. F. Adams 3 , S. Benson 1 , J. Brown 1,10 , P. Brune 1 , K. Buschel-
man 1 , E. M. Constantinescu 1 , L. Dalcin 4 , A. Dener 1 , V. Eijkhout 6 , J. Faibussowitsch 1,18 ,
W. D. Gropp 1,18 , V. Hapla 8 , T. Isaac 1,14 , P. Jolivet 12,22 , D. Karpeev 1 , D. Kaushik 1 , M. G.
Knepley 1,9 , F. Kong 1,11 , S. Kruger 15 , D. A. May 7,21 , L. Curfman McInnes 1 , R. Tran Mills
1
, L. Mitchell 13,20 , T. Munson 1 , J. E. Roman 16 , K. Rupp 1,19 , P. Sanan 1,8 , J. Sarich 1 , B. F.
Smith 1,17 , S. Zampini 4 , H. Zhang 1,5 , H. Zhang 1 , and J. Zhang 1
1
Mathematics and Computer Science Division, Argonne National Laboratory
2
Electricity Infrastructure and Buildings Division, Pacific Northwest National Laboratory
3
Computational Research, Lawrence Berkeley National Laboratory
4
Extreme Computing Research Center, King Abdullah University of Science and Technology
5
Department of Computer Science, Illinois Institute of Technology
6
Texas Advanced Computing Center, University of Texas at Austin
7
Department of Earth Sciences, University of Oxford
8
Institute of Geophysics, ETH Zurich
9
Department of Computer Science and Engineering, University at Buffalo
10
Department of Computer Science, University of Colorado, Boulder
11
Computational Frameworks, Idaho National Laboratory
12
Sorbonne Université, CNRS, LIP6
13
NVIDIA Corporation
14
College of Computing, Georgia Tech
15
Tech-X Corporation
16
DSIC, Universitat Politècnica de València
17
Flatiron Institute, Simons Foundation
18
University of Illinois, Urbana-Champaign
19
Institute for Microelectronics, TU Wien
20
Department of Computer Science, Durham University
21
Scripps Institution of Oceanography, University of California, San Diego
22
Toulouse INP, CNRS, Institute of Computer Science Research
June 1, 2023
This work was supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S.
Department of Energy, under Contract DE-AC02-06CH11357.
PETSc/TAO Users Manual, Release 3.19.2
ii
CONTENTS
1 Introduction to PETSc 1
1.1 About This Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Running PETSc Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Writing PETSc Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Simple PETSc Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Parallel and GPU Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 MPI Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 CPU SIMD parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.3 CPU OpenMP parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.4 GPU kernel parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.5 GPU stream parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Compiling and Running Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5 Profiling Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Writing C/C++ or Fortran Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7 PETSc’s Object-Oriented Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7.1 User Callbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8 Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
iii
PETSc/TAO Users Manual, Release 3.19.2
iv CONTENTS
PETSc/TAO Users Manual, Release 3.19.2
CONTENTS v
PETSc/TAO Users Manual, Release 3.19.2
Bibliography 297
vi CONTENTS
CHAPTER
ONE
INTRODUCTION TO PETSC
This manual describes the use of the Portable, Extensible Toolkit for Scientific Computation (PETSc) and
the Toolkit for Advanced Optimization (TAO) for the numerical solution of partial differential equations and
related problems on high-performance computers. PETSc/TAO is a suite of data structures and routines
that provide the building blocks for the implementation of large-scale application codes on parallel (and
serial) computers. PETSc uses the MPI standard for all distributed memory communication.
PETSc/TAO includes a large suite of parallel linear solvers, nonlinear solvers, time integrators, and opti-
mization that may be used in application codes written in Fortran, C, C++, and Python (via petsc4py; see
Getting Started ). PETSc provides many of the mechanisms needed within parallel application codes, such
as parallel matrix and vector assembly routines. The library is organized hierarchically, enabling users to
employ the level of abstraction that is most appropriate for a particular problem. By using techniques of
object-oriented programming, PETSc provides enormous flexibility for users.
PETSc is a sophisticated set of software tools; as such, for some users it initially has a much steeper learning
curve than packages such as MATLAB or a simple subroutine library. In particular, for individuals without
some computer science background, experience programming in C, C++, python, or Fortran and experience
using a debugger such as gdb or lldb, it may require a significant amount of time to take full advantage of
the features that enable efficient software use. However, the power of the PETSc design and the algorithms
it incorporates may make the efficient implementation of many application codes simpler than “rolling them”
yourself.
• For many tasks a package such as MATLAB is often the best tool; PETSc is not intended for the
classes of problems for which effective MATLAB code can be written.
• There are several packages (listed on https://petsc.org/), built on PETSc, that may satisfy your needs
without requiring directly using PETSc. We recommend reviewing these packages functionality before
starting to code directly with PETSc.
• PETSc can be used to provide a “MPI parallel linear solver” in an otherwise sequential, or OpenMP
parallel code. This approach cannot provide extremely large improvements in the application time by
utilizing large numbers of MPI processes but can still improve the performance. Certainly all parts
of a previously sequential code need not be parallelized but the matrix generation portion must be
parallelized to expect true scalability to large numbers of MPI processes. See PCMPI for details on
how to utilize the PETSc MPI linear solver server.
Since PETSc is under continued development, small changes in usage and calling sequences of routines will
occur. PETSc has been supported for twenty-five years; see mailing list information on our website for
information on contacting support.
1
PETSc/TAO Users Manual, Release 3.19.2
PETSc consists of a collection of classes, which are discussed in detail in later parts of the manual (The
Solvers in PETSc/TAO and Additional Information). The important PETSc classes include
• index sets (IS), for indexing into vectors, renumbering, permuting, etc;
• vectors (Vec); Vectors and Parallel Data
• matrices (Mat) (generally sparse); Matrices
• Krylov subspace methods (KSP); KSP: Linear System Solvers
• preconditioners, including multigrid, block solvers, patch solvers, and sparse direct solvers (PC);
• nonlinear solvers (SNES); SNES: Nonlinear Solvers
• timesteppers for solving time-dependent (nonlinear) PDEs, including support for differential algebraic
equations, and the computation of adjoints (sensitivities/gradients of the solutions) (TS); TS: Scalable
ODE and DAE Solvers
• scalable optimization algorithms including a rich set of gradient-based optimizers, Newton-based opti-
mizers and optimization with constraints (Tao). TAO: Optimization Solvers
• code for managing interactions between mesh data structures and vectors, matrices, and solvers (DM);
DM Basics
Each class consist of an abstract interface (simply a set of calling sequences; an abstract base class in C++)
and an implementation for each algorithm and data structure. This design enables easy comparison and use of
different algorithms (for example, to experiment with different Krylov subspace methods, preconditioners, or
truncated Newton methods). Hence, PETSc provides a rich environment for modeling scientific applications
as well as for rapid algorithm design and prototyping.
The classes enable easy customization and extension of both algorithms and implementations. This approach
promotes code reuse and flexibility, and also separates the issues of parallelism from the choice of algorithms.
The PETSc infrastructure creates a foundation for building large-scale applications.
It is useful to consider the interrelationships among different pieces of PETSc. Numerical Libraries in
PETSc is a diagram of some of these pieces. The figure illustrates the library’s hierarchical organization,
which enables users to employ the solvers that are most appropriate for a particular problem.
Additional Information describes a variety of useful information, including profiling, the options database,
viewers, error handling, and some details of PETSc design.
PETSc has evolved to become quite a comprehensive package, and therefore this manual can be rather
intimidating for new users. Bear in mind that PETSc can be used efficiently before one understands all
of the material presented here. Furthermore, the definitive reference for any PETSc function is always the
online manual page. Manual pages for all PETSc functions can be accessed here. The manual pages provide
hyperlinked indices (organized by both concept and routine name) to the tutorial examples and enable easy
movement among related topics.
Visual Studio Code, Eclipse, Emacs, and Vim users may find their development environment’s options for
searching in the source code are useful for exploring the PETSc source code. Details of these feature are
provided in Developer Environments.
The complete PETSc distribution, manual pages, and additional information are available via the PETSc
home page. The PETSc home page also contains details regarding installation, new features and changes in
recent versions of PETSc, machines that we currently support, and a frequently asked questions (FAQ) list.
Note to Fortran Programmers: In most of the manual, the examples and calling sequences are given for
the C/C++ family of programming languages. However, Fortran programmers can use all of the functionality
of PETSc from Fortran, with only minor differences in the user interface. PETSc for Fortran Users provides
a discussion of the differences between using PETSc from Fortran and C, as well as several complete Fortran
examples.
Note to Python Programmers: To program with PETSc in Python you need to enable Python bindings
(i.e petsc4py) with the configure option --with-petsc4py=1. See the PETSc installation guide for more
details.
Before using PETSc, the user must first set the environmental variable $PETSC_DIR, indicating the full
path of the PETSc home directory. For example, under the Unix bash shell a command of the form
$ export PETSC_DIR=$HOME/petsc
can be placed in the user’s .bashrc or other startup file. In addition, the user may need to set the
environment variable $PETSC_ARCH to specify a particular configuration of the PETSc libraries. Note that
$PETSC_ARCH is just a name selected by the installer to refer to the libraries compiled for a particular
set of compiler options and machine type. Using different values of $PETSC_ARCH allows one to switch
between several different sets (say debug and optimized) of libraries easily. To determine if you need to set
$PETSC_ARCH, look in the directory indicated by $PETSC_DIR, if there are subdirectories beginning with
arch then those subdirectories give the possible values for $PETSC_ARCH.
See handson to immediately jump in and run PETSc code.
All PETSc programs use the MPI (Message Passing Interface) standard for message-passing communication
[For94]. Thus, to execute PETSc programs, users must know the procedure for beginning MPI jobs on their
selected computer system(s). For instance, when using the MPICH implementation of MPI and many others,
the following command initiates a program that uses eight processors:
PETSc also comes with a script that automatically uses the correct mpiexec for your configuration.
All PETSc-compliant programs support the use of the -help option as well as the -version option.
Certain options are supported by all PETSc programs. We list a few particularly useful ones below; a
complete list can be obtained by running any PETSc program with the option -help.
• -log_view - summarize the program’s performance (see Profiling)
• -fp_trap - stop on floating-point exceptions; for example divide by zero
• -malloc_dump - enable memory tracing; dump list of unfreed memory at conclusion of the run, see
Detecting Memory Allocation Problems,
• -malloc_debug - enable memory debugging (by default this is activated for the debugging version
of PETSc), see Detecting Memory Allocation Problems,
• -start_in_debugger [noxterm,gdb,lldb] [-display name] - start all processes in debug-
ger. See Debugging, for more information on debugging PETSc programs.
• -on_error_attach_debugger [noxterm,gdb,lldb] [-display name] - start debugger only
on encountering an error
• -info - print a great deal of information about what the program is doing as it runs
which initializes PETSc and MPI. The arguments argc and argv are the command line arguments delivered
in all C and C++ programs. The argument file optionally indicates an alternative name for the PETSc
options file, .petscrc, which resides by default in the user’s home directory. Runtime Options provides
details regarding this file and the PETSc options database, which can be used for runtime customization.
The final argument, help, is an optional character string that will be printed if the program is run with the
-help option. In Fortran the initialization command has the form
Where the file argument is optional. PetscInitialize() automatically calls MPI_Init() if MPI has
not been not previously initialized. In certain circumstances in which MPI needs to be initialized directly
(or is initialized by some other library), the user can first call MPI_Init() (or have the other library
do it), and then call PetscInitialize(). By default, PetscInitialize() sets the PETSc “world”
communicator PETSC_COMM_WORLD to MPI_COMM_WORLD.
For those not familiar with MPI, a communicator is a way of indicating a collection of processes that will
be involved together in a calculation or communication. Communicators have the variable type MPI_Comm.
In most cases users can employ the communicator PETSC_COMM_WORLD to indicate all processes in a given
run and PETSC_COMM_SELF to indicate a single process.
MPI provides routines for generating new communicators consisting of subsets of processors, though most
users rarely need to use these. The book Using MPI, by Lusk, Gropp, and Skjellum [GLS94] provides an
excellent introduction to the concepts in MPI. See also the MPI homepage. Note that PETSc users need
not program much message passing directly with MPI, but they must be familiar with the basic concepts of
message passing and distributed memory computing.
All PETSc programs should call PetscFinalize() as their final (or nearly final) statement. This routine
handles options to be called at the conclusion of the program, and calls MPI_Finalize() if PetscIni-
tialize() began MPI. If MPI was initiated externally from PETSc (by either the user or another software
package), the user is responsible for calling MPI_Finalize().
Error Checking
Most PETSc functions return a PetscErrorCode, which is an integer indicating whether an error has
occurred during the call. The error code is set to be nonzero if an error has been detected; otherwise, it is
zero. For the C/C++ interface, the error variable is the routine’s return value, while for the Fortran version,
each PETSc routine has as its final argument an integer error variable.
One should always check these routine values as given below in the C/C++ and Fortran formats, respectively:
PetscCall(PetscFunction(Args));
or
! within the main program
PetscCallA(PetscFunction(Args,ierr))
These macros check the returned error code and if it is nonzero they call the PETSc error handler and then
return from the function with the error code. PetscCallA() calls abort after calling the error handler
because it is not possible to return from a Fortran main program. The above macros should be used in
all subroutines to enable a complete error traceback. See Error Checking for more details on PETSc error
handling.
To help the user start using PETSc immediately, we begin with a simple uniprocessor example that solves
the one-dimensional Laplacian problem with finite differences. This sequential code, which can be found
in $PETSC_DIR/src/ksp/ksp/tutorials/ex1.c, illustrates the solution of a linear system with KSP,
the interface to the preconditioners, Krylov subspace methods, and direct linear solvers of PETSc. Following
the code we highlight a few of the most important parts of this example.
/*
Include "petscksp.h" so that we can use KSP solvers. Note that this file
automatically includes:
petscsys.h - base PETSc routines petscvec.h - vectors
petscmat.h - matrices petscpc.h - preconditioners
petscis.h - index sets
petscviewer.h - viewers
PetscFunctionBeginUser;
PetscCall(PetscInitialize(&argc, &args, (char *)0, help));
PetscCallMPI(MPI_Comm_size(PETSC_COMM_WORLD, &size));
PetscCheck(size == 1, PETSC_COMM_WORLD, PETSC_ERR_WRONG_MPI_SIZE, "This is a␣
,→uniprocessor example only!");
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Compute the matrix and right-hand-side vector that define
the linear system, Ax = b.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create vectors. Note that we form 1 vector from scratch and
then duplicate as needed.
*/
PetscCall(VecCreate(PETSC_COMM_SELF, &x));
PetscCall(PetscObjectSetName((PetscObject)x, "Solution"));
PetscCall(VecSetSizes(x, PETSC_DECIDE, n));
PetscCall(VecSetFromOptions(x));
PetscCall(VecDuplicate(x, &b));
PetscCall(VecDuplicate(x, &u));
/*
Create matrix. When using MatCreate(), the matrix format can
be specified at runtime.
/*
Assemble matrix
*/
value[0] = -1.0;
value[1] = 2.0;
value[2] = -1.0;
for (i = 1; i < n - 1; i++) {
col[0] = i - 1;
col[1] = i;
col[2] = i + 1;
PetscCall(MatSetValues(A, 1, &i, 3, col, value, INSERT_VALUES));
}
i = n - 1;
/*
Set exact solution; then compute right-hand-side vector.
*/
PetscCall(VecSet(u, 1.0));
PetscCall(MatMult(A, u, b));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create the linear solver and set various options
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
PetscCall(KSPCreate(PETSC_COMM_SELF, &ksp));
/*
Set operators. Here the matrix that defines the linear system
also serves as the matrix that defines the preconditioner.
*/
PetscCall(KSPSetOperators(ksp, A, A));
/*
Set linear solver defaults for this problem (optional).
- By extracting the KSP and PC contexts from the KSP context,
we can then directly call any KSP and PC routines to set
various options.
- The following four statements are optional; all of these
parameters could alternatively be specified at runtime via
KSPSetFromOptions();
*/
PetscCall(KSPGetPC(ksp, &pc));
PetscCall(PCSetType(pc, PCJACOBI));
PetscCall(KSPSetTolerances(ksp, 1.e-5, PETSC_DEFAULT, PETSC_DEFAULT, PETSC_
,→DEFAULT));
/*
Set runtime options, e.g.,
-ksp_type <type> -pc_type <type> -ksp_monitor -ksp_rtol <rtol>
These options will override those specified above as long as
KSPSetFromOptions() is called _after_ any other customization
routines.
*/
PetscCall(KSPSetFromOptions(ksp));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Solve the linear system
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
PetscCall(KSPSolve(ksp, b, x));
/*
View solver info; we could instead use the option -ksp_view to
print this info to the screen at the conclusion of KSPSolve().
*/
PetscCall(KSPView(ksp, PETSC_VIEWER_STDOUT_SELF));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Check the solution and clean up
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
PetscCall(VecAXPY(x, -1.0, u));
PetscCall(VecNorm(x, NORM_2, &norm));
PetscCall(KSPGetIterationNumber(ksp, &its));
PetscCall(PetscPrintf(PETSC_COMM_SELF, "Norm of error %g, Iterations %" PetscInt_
,→FMT "\n", (double)norm, its));
/* check that KSP automatically handles the fact that the the new non-zero values␣
,→ in the matrix are propagated to the KSP solver */
PetscCall(MatShift(A, 2.0));
PetscCall(KSPSolve(ksp, b, x));
/*
Free work space. All PETSc objects should be destroyed when they
are no longer needed.
*/
PetscCall(VecDestroy(&x));
PetscCall(VecDestroy(&u));
PetscCall(VecDestroy(&b));
PetscCall(MatDestroy(&A));
PetscCall(KSPDestroy(&ksp));
/*
Always call PetscFinalize() before exiting a program. This routine
- finalizes the PETSc libraries as well as MPI
- provides summary and diagnostic information if certain runtime
options are chosen (e.g., -log_view).
*/
PetscCall(PetscFinalize());
return 0;
}
Include Files
The C/C++ include files for PETSc should be used via statements such as
#include <petscksp.h>
where petscksp.h is the include file for the linear solver library. Each PETSc program must specify an
include file that corresponds to the highest level PETSc objects needed within the program; all of the required
lower level include files are automatically included within the higher level files. For example, petscksp.
h includes petscmat.h (matrices), petscvec.h (vectors), and petscsys.h (base PETSc file). The
PETSc include files are located in the directory $PETSC_DIR/include. See Modules and Include Files for
a discussion of PETSc include files in Fortran programs.
As shown in Simple PETSc Examples, the user can input control data at run time using the options database.
In this example the command PetscOptionsGetInt(NULL,NULL,"-n",&n,&flg); checks whether
the user has provided a command line option to set the value of n, the problem dimension. If so, the variable
n is set accordingly; otherwise, n remains unchanged. A complete description of the options database may
be found in Runtime Options.
Vectors
One creates a new parallel or sequential vector, x, of global dimension M with the commands
where comm denotes the MPI communicator and m is the optional local size which may be PETSC_DECIDE.
The type of storage for the vector may be set with either calls to VecSetType() or VecSetFromOp-
tions(). Additional vectors of the same type can be formed with
The commands
respectively set all the components of a vector to a particular scalar value and assign a different value
to each component. More detailed information about PETSc vectors, including their basic operations,
scattering/gathering, index sets, and distributed arrays, is discussed in Chapter Vectors and Parallel Data.
Note the use of the PETSc variable type PetscScalar in this example. The PetscScalar is simply
defined to be double in C/C++ (or correspondingly double precision in Fortran) for versions of
PETSc that have not been compiled for use with complex numbers. The PetscScalar data type enables
identical code to be used when the PETSc libraries have been compiled for use with complex numbers.
Numbers discusses the use of complex numbers in PETSc programs.
Matrices
Usage of PETSc matrices and vectors is similar. The user can create a new parallel or sequential matrix, A,
which has M global rows and N global columns, with the routines
where the matrix format can be specified at runtime via the options database. The user could alternatively
specify each processes’ number of local rows and columns using m and n.
Generally one then sets the “type” of the matrix, with, for example,
MatSetType(A,MATAIJ);
This causes the matrix A to used the compressed sparse row storage format to store the matrix entries. See
MatType for a list of all matrix types. Values can then be set with the command
After all elements have been inserted into the matrix, it must be processed with the pair of commands
MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);
MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);
Matrices discusses various matrix formats as well as the details of some basic matrix manipulation routines.
Linear Solvers
After creating the matrix and vectors that define a linear system, Ax = b, the user can then use KSP to
solve the system with the following sequence of commands:
KSPCreate(MPI_Comm comm,KSP *ksp);
KSPSetOperators(KSP ksp,Mat Amat,Mat Pmat);
KSPSetFromOptions(KSP ksp);
KSPSolve(KSP ksp,Vec b,Vec x);
KSPDestroy(KSP ksp);
The user first creates the KSP context and sets the operators associated with the system (matrix that defines
the linear system, Amat and matrix from which the preconditioner is constructed, Pmat). The user then sets
various options for customized solution, solves the linear system, and finally destroys the KSP context. We
emphasize the command KSPSetFromOptions(), which enables the user to customize the linear solution
method at runtime by using the options database, which is discussed in Runtime Options. Through this
database, the user not only can select an iterative method and preconditioner, but also can prescribe the
convergence tolerance, set various monitoring routines, etc. (see, e.g., Profiling Programs).
KSP: Linear System Solvers describes in detail the KSP package, including the PC and KSP packages for
preconditioners and Krylov subspace methods.
Nonlinear Solvers
Most PDE problems of interest are inherently nonlinear. PETSc provides an interface to tackle the nonlinear
problems directly called SNES. SNES: Nonlinear Solvers describes the nonlinear solvers in detail. We highly
recommend most PETSc users work directly with SNES, rather than using PETSc for the linear problem
and writing their own nonlinear solver.
Error Checking
As noted above PETSc functions return a PetscErrorCode, which is an integer indicating whether an
error has occurred during the call. Below, we indicate a traceback generated by error detection within a
sample PETSc program. The error occurred on line 3618 of the file $PETSC_DIR/src/mat/impls/aij/
seq/aij.c and was caused by trying to allocate too large an array in memory. The routine was called
in the program ex3.c on line 66. See Error Checking for details regarding error checking when using the
PETSc Fortran interface.
$ cd $PETSC_DIR/src/ksp/ksp/tutorials
$ make ex3
$ mpiexec -n 1 ./ex3 -m 100000
[0]PETSC ERROR: --------------------- Error Message --------------------------------
[0]PETSC ERROR: Out of memory. This could be due to allocating
(continues on next page)
When running the debug version of the PETSc libraries, it does a great deal of checking for memory cor-
ruption (writing outside of array bounds etc). The macro CHKMEMQ can be called anywhere in the code to
check the current status of the memory for corruption. By putting several (or many) of these macros into
your code you can usually easily track down in what small segment of your code the corruption has occurred.
One can also use Valgrind to track down memory errors; see the FAQ.
For complete error handling, calls to MPI functions should be made with Petsc-
CallMPI(MPI_Function(Args)). In the main Fortran program the calls should be Petsc-
CallMPIA(MPI_Function(Args)).
PETSc has a small number of C/C++ only macros that do not explicitly return error codes. These are used
in the style
XXXBegin(Args);
other code
XXXEnd();
Since PETSc uses the message-passing model for parallel programming and employs MPI for all interprocessor
communication, the user is free to employ MPI routines as needed throughout an application code. However,
by default the user is shielded from many of the details of message passing within PETSc, since these are
hidden within parallel objects, such as vectors, matrices, and solvers. In addition, PETSc provides tools
such as generalized vector scatters/gathers to assist in the management of parallel data.
Recall that the user must specify a communicator upon creation of any PETSc object (such as a vector,
matrix, or solver) to indicate the processors over which the object is to be distributed. For example, as
mentioned above, some commands for matrix, vector, and linear solver creation are:
The creation routines are collective over all processors in the communicator; thus, all processors in the
communicator must call the creation routine. In addition, if a sequence of collective routines is being used,
they must be called in the same order on each processor.
The next example, given below, illustrates the solution of a linear system in parallel. This code, corresponding
to KSP Tutorial ex2, handles the two-dimensional Laplacian discretized with finite differences, where the
linear system is again solved with KSP. The code performs the same tasks as the sequential version within
Simple PETSc Examples. Note that the user interface for initiating the program, creating vectors and
matrices, and solving the linear system is exactly the same for the uniprocessor and multiprocessor examples.
The primary difference between the examples in Simple PETSc Examples and here is that each processor
forms only its local part of the matrix and vectors in the parallel case.
/*
Include "petscksp.h" so that we can use KSP solvers.
*/
#include <petscksp.h>
PetscFunctionBeginUser;
PetscCall(PetscInitialize(&argc, &args, (char *)0, help));
PetscCall(PetscOptionsGetInt(NULL, NULL, "-m", &m, NULL));
PetscCall(PetscOptionsGetInt(NULL, NULL, "-n", &n, NULL));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Compute the matrix and right-hand-side vector that define
the linear system, Ax = b.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create parallel matrix, specifying only its global dimensions.
When using MatCreate(), the matrix format can be specified at
runtime. Also, the parallel partitioning of the matrix is
determined by PETSc at runtime.
/*
Currently, all PETSc parallel matrix formats are partitioned by
contiguous chunks of rows across the processors. Determine which
rows of the matrix are locally owned.
*/
PetscCall(MatGetOwnershipRange(A, &Istart, &Iend));
/*
Set matrix elements for the 2-D, five-point stencil in parallel.
Note: this uses the less common natural ordering that orders first
all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
instead of J = I +- m as you might expect. The more standard ordering
would first do all variables for y = h, then y = 2h etc.
*/
for (Ii = Istart; Ii < Iend; Ii++) {
v = -1.0;
i = Ii / n;
j = Ii - i * n;
if (i > 0) {
J = Ii - n;
PetscCall(MatSetValues(A, 1, &Ii, 1, &J, &v, ADD_VALUES));
}
if (i < m - 1) {
J = Ii + n;
PetscCall(MatSetValues(A, 1, &Ii, 1, &J, &v, ADD_VALUES));
}
if (j > 0) {
J = Ii - 1;
PetscCall(MatSetValues(A, 1, &Ii, 1, &J, &v, ADD_VALUES));
}
if (j < n - 1) {
J = Ii + 1;
PetscCall(MatSetValues(A, 1, &Ii, 1, &J, &v, ADD_VALUES));
}
v = 4.0;
PetscCall(MatSetValues(A, 1, &Ii, 1, &Ii, &v, ADD_VALUES));
}
/*
Assemble matrix, using the 2-step process:
MatAssemblyBegin(), MatAssemblyEnd()
Computations can be done while messages are in transition
by placing code between these two statements.
*/
PetscCall(MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY));
PetscCall(MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY));
/*
Create parallel vectors.
- We form 1 vector from scratch and then duplicate as needed.
- When using VecCreate(), VecSetSizes and VecSetFromOptions()
in this example, we specify only the
vector's global dimension; the parallel partitioning is determined
at runtime.
- When solving a linear system, the vectors and matrices MUST
be partitioned accordingly. PETSc automatically generates
/*
Set exact solution; then compute right-hand-side vector.
By default we use an exact solution of a vector with all
elements of 1.0;
*/
PetscCall(VecSet(u, 1.0));
PetscCall(MatMult(A, u, b));
/*
View the exact solution vector if desired
*/
flg = PETSC_FALSE;
PetscCall(PetscOptionsGetBool(NULL, NULL, "-view_exact_sol", &flg, NULL));
if (flg) PetscCall(VecView(u, PETSC_VIEWER_STDOUT_WORLD));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create the linear solver and set various options
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
PetscCall(KSPCreate(PETSC_COMM_WORLD, &ksp));
/*
Set operators. Here the matrix that defines the linear system
also serves as the preconditioning matrix.
*/
PetscCall(KSPSetOperators(ksp, A, A));
/*
Set linear solver defaults for this problem (optional).
- By extracting the KSP and PC contexts from the KSP context,
we can then directly call any KSP and PC routines to set
various options.
- The following two statements are optional; all of these
parameters could alternatively be specified at runtime via
KSPSetFromOptions(). All of these defaults can be
overridden at runtime, as indicated below.
*/
PetscCall(KSPSetTolerances(ksp, 1.e-2 / ((m + 1) * (n + 1)), 1.e-50, PETSC_DEFAULT,␣
,→PETSC_DEFAULT));
/*
Set runtime options, e.g.,
-ksp_type <type> -pc_type <type> -ksp_monitor -ksp_rtol <rtol>
These options will override those specified above as long as
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Solve the linear system
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
PetscCall(KSPSolve(ksp, b, x));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Check the solution and clean up
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
PetscCall(VecAXPY(x, -1.0, u));
PetscCall(VecNorm(x, NORM_2, &norm));
PetscCall(KSPGetIterationNumber(ksp, &its));
/*
Print convergence information. PetscPrintf() produces a single
print statement from all processes that share a communicator.
An alternative is PetscFPrintf(), which prints to a file.
*/
PetscCall(PetscPrintf(PETSC_COMM_WORLD, "Norm of error %g iterations %" PetscInt_
,→FMT "\n", (double)norm, its));
/*
Free work space. All PETSc objects should be destroyed when they
are no longer needed.
*/
PetscCall(KSPDestroy(&ksp));
PetscCall(VecDestroy(&u));
PetscCall(VecDestroy(&x));
PetscCall(VecDestroy(&b));
PetscCall(MatDestroy(&A));
/*
Always call PetscFinalize() before exiting a program. This routine
- finalizes the PETSc libraries as well as MPI
- provides summary and diagnostic information if certain runtime
options are chosen (e.g., -log_view).
*/
PetscCall(PetscFinalize());
return 0;
}
SIMD parallelism occurs most commonly in the Intel advanced vector extensions (AVX) Wikipedia
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions families of instructions. It may be automat-
ically used by the optimizing compiler, or in low-level libraries that PETSc uses such as BLAS (see
BLIS https://github.com/flame/blis, or rarely, directly in PETSc C/C++ code, as in MatMult_SeqSELL
https://petsc.org/main/src/mat/impls/sell/seq/sell.c.html#MatMult_SeqSELL.
OpenMP parallelism is thread parallelism. Multiple threads (independent streams of instructions) process
data and perform computations on different parts of memory that is shared (accessible) to all of the threads.
The OpenMP model is most-often based on inserting pragmas into code indicating that a series of instructions
(often within a loop) can be run in parallel. This is also called a fork-join model of parallelism, since much of
the code remains sequential and only the computationally expensive parts in the ‘parallel region’ are parallel.
OpenMP thus makes it relatively easy to add some degree of parallelism to a conventional sequential code
in a shared memory environment.
POSIX threads (pthreads) is a library that may be called from C/C++. The library contains routines
to create, join, and remove threads plus manage communications and synchronizations between threads.
Pthreads is rarely used directly in numerical libraries and applications. Sometimes OpenMP is implemented
on top of pthreads.
If one adds OpenMP parallelism to an MPI code one must make sure not to over-subscribe the hardware
resources. For example, if MPI already has one rank per hardware core then using four OpenMP threads
per MPI rank will slow the code down since now one core will need to switch back and forth between four
OpenMP threads. There are limited practical advantages to a combined MPI and OpenMP model in PETSc,
but it is possible.
For application codes that uses certain external packages including BLAS/LAPACK, SuperLU_DIST,
MUMPS, MKL, and SuiteSparse one can build PETSc and these packages to take advantage of OpenMP
by using the configure option --with-openmp. The number of OpenMP threads used in the application
can be controlled with the PETSc command line option -omp_num_threads <num> or the environmen-
tal variable OMP_NUM_THREADS. Running a PETSc program with -omp_view will display the number of
threads being used. The default number is often absurdly high for the given hardware so we recommend
always setting it appropriately. Users can also put OpenMP pragmas into their own code. However since
standard PETSc is not thread-safe, they should not, in general, call PETSc routines from inside the parallel
regions.
PETSc MPI based linear solvers may be accessed from a sequential or OpenMP program with the PCMPI
solver wrapper, see Using a MPI parallel linear solver from a non-MPI program.
There is an OpenMP thread-safe subset of PETSc that may be configured for using --with-threadsafety
[--with-openmp or --download-concurrencykit]. KSP Tutorial ex61f demonstrates how this may
be used with OpenMP. In this mode one may have individual OpenMP threads that each manage their own
(sequential) PETSc objects (each thread can interact only with its own objects). This is useful when one
has many small systems (or sets of ODEs) that must be integrated in an “embarrassingly parallel” fashion
on multicore systems.
See also:
Edward A. Lee, The Problem with Threads, Technical Report No. UCB/EECS-2006-1 January [DOI] 10,
2006
GPUs offer at least two levels of clearly defined parallelism. Kernel level parallelism is much like SIMD
parallelism applied to loops; many different “iterations” of the loop index run on different hardware but
in “lock-step” at the same time. PETSc utilizes this parallelism with three similar, but slightly different
models:
• CUDA, which is provided by NVIDIA and runs on NVIDIA GPUs
• HIP, provided by AMD, which can, in theory, run on both AMD and NVIDIA GPUs
• and Kokkos, an open-source package that provides a slightly higher level programming model to utilize
GPU kernels.
To utilize this one configures PETSc with either –with-cuda or –with-hip and, if they plan to use Kokkos,
also –with-kokkos –with-kokkos-kernels.
In the GPU programming model that PETSc uses the GPU memory is distinct from the CPU memory.
This means that data that resides on the CPU memory must be copied to the GPU (often this copy is
done automatically by the libraries and the user does not need to manage it) if one wishes to use the GPU
computational power on it. This memory copy is slow compared to the GPU speed hence it is crucial to
minimize these copies. This often translates to trying to do almost all the computation on the GPU and not
constantly switching between computations on the CPU and the GPU on the same data.
PETSc utilizes GPUs by providing vector and matrix classes (Vec and Mat) that are specifically written to
run fast on the GPU. However, since it is difficult to write an entire PETSc code that runs only on the GPU
one can also access and work with (for example, put entries into) the vectors and matrices on the CPU.
The vector classes are VECCUDA, MATAIJCUSPARSE, VECKOKKOS, MATAIJKOKKOS, and VECHIP
(matrices are not yet supported from PETSc with HIP).
More details on using GPUs from PETSc will follow in this document.
Incomplete
The output below illustrates compiling and running a PETSc program using MPICH on a macOS laptop.
Note that different machines will have compilation commands as determined by the configuration process.
See Writing C/C++ or Fortran Applications for a discussion about how to compile your PETSc programs.
Users who are experiencing difficulties linking PETSc programs should refer to the FAQ.
$ cd $PETSC_DIR/src/ksp/ksp/tutorials
$ make ex2
/Users/patrick/petsc/arch-darwin-double-debug/bin/mpicc -o ex2.o -c -g3 -I/Users/
,→patrick/petsc/include -I/Users/patrick/petsc/arch-darwin-double-debug/include -I/
/bin/rm -f ex2.o
$ $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 1 ./ex2
Norm of error 0.000156044 iterations 6
$ $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 2 ./ex2
Norm of error 0.000411674 iterations 7
The option -log_view activates printing of a performance summary, including times, floating point op-
eration (flop) rates, and message-passing activity. Profiling provides details about profiling, including in-
terpretation of the output data below. This particular example involves the solution of a linear system on
one processor using GMRES and ILU. The low floating point operation (flop) rates in this example are due
to the fact that the code solved a tiny system. We include this example merely to demonstrate the ease of
extracting performance information.
$ $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 1 ./ex1 -n 1000 -pc_type ilu -ksp_type␣
,→gmres -ksp_rtol 1.e-7 -log_view
...
--------------------------------------------------------------------------------------
,→----------------------------------
Vector 8 8 76224 0.
Matrix 2 2 134212 0.
Krylov Solver 1 1 18400 0.
Preconditioner 1 1 1032 0.
Index Set 3 3 10328 0.
Viewer 1 0 0 0.
=====================================================================================================
...
The examples throughout the library demonstrate the software usage and can serve as templates for
developing custom applications. We suggest that new PETSc users examine programs in the direc-
tories $PETSC_DIR/src/<library>/tutorials where <library> denotes any of the PETSc li-
braries (listed in the following section), such as SNES or KSP or TS. The manual pages located at
https://petsc.org/release/documentation/ provide links (organized by both routine names and concepts)
to the tutorial examples.
To develop an application program that uses PETSc, we suggest the following:
• Download and install PETSc.
• For completely new applications
1. Make a directory for your source code: for example, mkdir $HOME/application
2. Change to that directory; for example, cd $HOME/application
3. Copy an example in the directory that corresponds to the problems of interest into your directory,
for example, cp $PETSC_DIR/src/snes/tutorials/ex19.c ex19.c
4. Select an application build process. The PETSC_DIR (and PETSC_ARCH if the
--prefix=directoryname option was not used when configuring PETSc) environmental vari-
able(s) must be set for any of these approaches.
– make (recommended). Copy $PETSC_DIR/share/petsc/Makefile.user or
$PETSC_DIR/share/petsc/Makefile.basic.user to your directory, for example, cp
$PETSC_DIR/share/petsc/Makefile.user makefile
Examine the comments in your makefile
Makefile.user uses the pkg-config tool and is the recommended approach.
Use make ex19 to compile your program
– CMake. Copy $PETSC_DIR/share/petsc/CMakeLists.txt to your directory, for example, cp
$PETSC_DIR/share/petsc/CMakeLists.txt CMakeLists.txt
Edit CMakeLists.txt, read the comments on usage and change the name of application from
ex1 to your application executable name.
5. Run the program, for example, ./ex19
6. Start to modify the program for developing your application.
• For adding PETSc to an existing application
1. Start with a working version of your code that you build and run to confirm that it works.
2. Upgrade your build process. The PETSC_DIR (and PETSC_ARCH if the
--prefix=directoryname option was not used when configuring PETSc) environmen-
tal variable(s) must be set for any of these approaches.
– Using make. Update the application makefile to add the appropriate PETSc include directo-
ries and libraries.
∗ Recommended approach. Examine the comments in
$PETSC_DIR/share/petsc/Makefile.user and transfer selected portions of that file
to your makefile.
∗ Minimalist. Add the line
include ${PETSC_DIR}/lib/petsc/conf/variables
to the bottom of your makefile. This will provide a set of PETSc specific
make variables you may use in your makefile. See the comments in the file
$PETSC_DIR/share/petsc/Makefile.basic.user for details on the usage.
∗ Simple, but hands the build process over to PETSc’s control. Add the lines
include ${PETSC_DIR}/lib/petsc/conf/variables
include ${PETSC_DIR}/lib/petsc/conf/rules
All the libraries listed need to be linked into your executable and the include directories
and flags need to be passed to the compiler(s). Usually this is done by setting LD-
FLAGS=<list of library flags and libraries> and CFLAGS=<list of -I
and other flags> and FFLAGS=<list of -I and other flags> etc in your
makefile.
– Using CMake. Update the application CMakeLists.txt by examining the code and comments
in $PETSC_DIR/share/petsc/CMakeLists.txt
3. Rebuild your application and ensure it still runs correctly.
4. Add a PetscInitialize() near the beginning of your code and PetscFinalize() near the
end with appropriate include commands (and use commands in Fortran)
5. Rebuild your application and ensure it still runs correctly.
6. Slowly start utilizing PETSc functionality in your code, ensure that your code continues to build
and run correctly.
Though PETSc has a large API, conceptually it’s rather simple. There are three abstract basic data objects
(classes): index sets, IS, vectors, Vec, and matrices, Mat. Plus a larger number of abstract algorithm
objects (classes) starting with: preconditioners, PC, Krylov solvers, KSP, and so forth.
Let Object represent any of these objects. Objects are created with
Object obj;
ObjectCreate(MPI_Comm, &obj);
The object is empty and little can be done with it. A particular implementation of the class is associated
with the object by setting the object’s “type”, where type is merely a string name of an implementation
class using
Object obj;
ObjectSetType(obj,"Name");
Some objects support subclasses which are specializations of the type. These are set with
Object obj;
ObjectNameSetType(obj,"SubName");
TS obj;
TSCreate(PETSC_COMM_WORLD,&obj);
TSSetType(obj,TSARKIMEX);
TSARKIMEXSetType(obj,TSARKIMEX3);
The abstract class TS can embody any ODE/DAE integrator scheme. This example creates an additive
Runge-Kutta ODE/DAE IMEX integrator, whose type name is TSARKIMEX, using a 3-order scheme with
an L-stable implicit part, whose subtype name is TSARKIMEX3.
In order to allow PETSc objects to be runtime configurable, PETSc objects provide a universal way of
selecting types (classes) and subtypes at runtime, from what is referred to as the “options database”. The
code above can be replaced with
TS obj;
TSCreate(PETSC_COMM_WORLD,&obj);
TSSetFromOptions(obj);
now both the type and subtype can be conveniently set from the command line
The object’s type (implementation class) or subclass can also be changed at any time simply by calling
TSSetType() again (though in order to override command line options the call to TSSetType() must be
made _after_ TSSetFromOptions()). For example:
Since the later call always overrides the earlier call the second form shown is rarely – if ever – used, as it is
less flexible than configuring command line settings.
The standard methods on an object are of the general form
ObjectSetXXX(obj,...);
ObjectGetXXX(obj,...);
ObjectYYY(obj,...);
For example
TSSetRHSFunction(obj,...)
Particular types and subtypes of objects may have their own methods, which are given in the form
ObjectNameSetXXX(obj,...);
ObjectNameGetXXX(obj,...);
ObjectNameYYY(obj,...);
and
ObjectNameSubNameSetXXX(obj,...);
ObjectNameSubNameGetXXX(obj,...);
ObjectNameSubNameYYY(obj,...);
where Name and SubName are the type and subtype names (for example, as above TSARKIMEX and 3. Most
“set” operations have options database versions with the same names in lower case, separated by underscores
and with the set remove. For example,
$ ./app -ksp_gmres_restart 30
There are a special subset of type-specific methods that are ignored if the type does not match the function
name. These are usually setter functions that control some aspect specific to the subtype. For example,
These allow cleaner code since it does not have a multitude of if statements to avoid inactive methods. That
is one does not need to write code like
There are many “get” routines that give one temporary access to the internal data of an object. They are
used in the style
XXX xxx;
ObjectGetXXX(obj,&xxx);
// use xxx
ObjectRestoreXXX(obj,&xxx);
Objects obtained with a “get” routine should be returned with a “restore” routine, generally within the same
function. Objects obtained with a “create” routine should be freed with a “destroy” routine.
There may be variants of the “get” routines that give more limited access to the obtained object. For
example,
ObjectView(obj,PetscViewer viewer);
ObjectViewFromOptions(obj,...);
Where PetscViewer is an abstract object that can represent standard output, an ASCII or binary file, a
graphical window, etc. The second variant allows the user to delay until runtime the decision of what viewer
and format to use to view the object or if to view the object at all.
Objects are destroyed with
ObjectDestroy(&obj)
In many situations the user may also wish to override or provide custom functionality. This is handled via
callbacks which the library will call at the appropriate time. The most general callback is provided by
where callbackfunction() is what is used by the library, ctx is an optional data-structure (array,
struct, PETSc object) that is used by callbackfunction() and callbackdestroy(void *ctx) is
an optional function that will be called when obj is destroyed. The use of the callbackdestroy() allows
users to “set and forget” data structures that will not be needed elsewhere but still need to be cleaned up
when no longer needed. Here is an example of the use of a full-fledged callback
TS ts;
TSMonitorLGCtx *ctx;
TSMonitorLGCtxCreate(..., &ctx)
TSMonitorSet(ts, TSMonitorLGTimeStep, ctx, (PetscErrorCode(*)(void␣
,→**))TSMonitorLGCtxDestroy);
TSSolve(ts);
Occasionally routines to set callback functions take additional data objects that will be used by the object
but are not context data for the function. For example,
SNES obj;
Vec r;
void *ctx;
The r vector is an optional argument provided by the user which will be used as work-space by SNES. Note
that this callback does not provide a way for the user to have the ctx destroyed when the SNES object is
destroyed, the users must ensure that they free it at an appropriate time. There is no logic to the various
ways PETSc accepts callback functions in different places in the code.
See Tao use of PETSc and callbacks for a cartoon on the use of callbacks in Tao.
We conclude this introduction with an overview of the organization of the PETSc software. The root
directory of PETSc contains the following directories:
• doc (only in the tarball distribution of PETSc; not the git repository) - All documentation for PETSc.
The files manual.pdf contains the hyperlinked users manual, suitable for printing or on-screen view-
ering. Includes the subdirectory - manualpages (on-line manual pages).
• lib/petsc/conf - Base PETSc configuration files that define the standard make variables and rules
used by PETSc
• include - All include files for PETSc that are visible to the user.
• include/petsc/finclude - PETSc Fortran include files.
• include/petsc/private - Private PETSc include files that should not need to be used by appli-
cation programmers.
• share - Some small test matrices in data files
• src - The source code for all PETSc libraries, which currently includes
– vec - vectors,
∗ is - index sets,
– mat - matrices,
– ksp - complete linear equations solvers,
∗ ksp - Krylov subspace accelerators,
∗ pc - preconditioners,
– snes - nonlinear solvers
– ts - ODE solvers and timestepping,
– dm - data management between meshes and solvers, vectors, and matrices,
– sys - general system-related routines,
∗ logging - PETSc logging and profiling routines,
∗ classes - low-level classes
· draw - simple graphics,
· viewer - mechanism for printing and visualizing PETSc objects,
· bag - mechanism for saving and loading from disk user data stored in C structs.
· random - random number generators.
Each PETSc source code library directory has the following subdirectories:
• tutorials - Programs designed to teach users about PETSc.
These codes can serve as templates for the design of custom applications.
• tests - Programs designed for thorough testing of PETSc. As
such, these codes are not intended for examination by users.
• interface - Provides the abstract base classes for the objects. Code here does not know about
particular implementations and does not actually perform operations on the underlying numerical
data.
• impls - Source code for one or more implementations of the class for particular data structures or
algorithms.
• utils - Utility routines. Source here may know about the implementations, but ideally will not know
about implementations for other components.
TWO
Vectors (denoted by Vec) are used to store discrete PDE solutions, right-hand sides for linear systems, etc.
Users can create and manipulate entries in vectors directly with a basic, low-level interface or they can use
the PETSc DM objects to connect actions on vectors to the type of discretization and grid that they are
working with. These higher level interfaces handle much of the details of the interactions with vectors and
hence are preferred in most situations. This chapter is organized as follows:
• Creating Vectors
– User managed
– DMDA - Creating vectors for structured grids
– DMStag - Creating vectors for staggered grids
– DMPLEX - Creating vectors for unstructured grids
– DMNETWORK - Creating vectors for networks
• Setting vector values
– For generic vectors
– DMDA - Setting vector values
– DMSTAG - Setting vector values
– DMPLEX - Setting vector values
– DMNETWORK - Setting vector values
• Basic Vector Operations
• Local/global vectors and communicating between vectors
• Low-level Vector Communication
– Local to global mappings
– Global Vectors with locations for ghost values
• Application Orderings
31
PETSc/TAO Users Manual, Release 3.19.2
PETSc provides many ways to create vectors. The most basic, where the user is responsible for managing
the parallel distribution of the vector entries, and a variety of higher-level approaches, based on DM, for
classes of problems such as structured grids, staggered grids, unstructured grids, networks, and particles.
The most basic way to create a vector with a local size of m and a global size of M, is using the commands
which automatically generates the appropriate vector type (sequential or parallel) over all processes in comm.
The option -vec_type <type> can be used in conjunction with VecSetFromOptions() to specify the
use of a particular type of vector. For example, for NVIDIA GPU CUDA use cuda. The GPU based vectors
allow one to set values on either the CPU or GPU but do their computations on the GPU.
We emphasize that all processes in comm must call the vector creation routines, since these routines are
collective over all processes in the communicator. If you are not familiar with MPI communicators, see the
discussion in Writing PETSc Programs on page . In addition, if a sequence of creation routines is used, they
must be called in the same order on each process in the communicator.
Instead of, or before calling VecSetFromOptions(), one can call
One can create vectors whose entries are stored on GPUs using, the short-hand helper routine,
There are short hand creation routines for almost all vector types; we recommend using the more verbose
form because it allows selecting CPU or GPU simulations at runtime.
For applications running in parallel that involve multi-dimensional structured grids, unstructured grids,
networks, etc it is cumbersome and complicated for users to explicitly manage the needed local and global
sizes of the vectors. Hence PETSc provides a powerful abstract object called the DM to help manage the
vectors and matrices needed for such applications. Parallel vectors can be created easily with
The DM object, see DMDA - Creating vectors for structured grids, DMStag - Creating vectors for staggered
grids, and DMPlex: Unstructured Grids for more details on DM for structured grids, staggered structured
grids and for unstructured grids, manages creating the correctly sized parallel vectors efficiently. One controls
the type of vector that DM creates by calling
or by calling DMSetFromOptions(DM dm) and using the option -dm_vec_type <standard or cuda
or kokkos etc>
Each DM type is suitable for a family of problems. The first of these DMDA are intended for use with logically
structured rectangular grids when communication of nonlocal data is needed before certain local computations
can occur. PETSc distributed arrays are designed only for the case in which data can be thought of as being
stored in a standard multidimensional array; thus, DMDAs are not intended for parallelizing unstructured
grid problems, etc.
For example, a typical situation one encounters in solving PDEs in parallel is that, to evaluate a local function,
f(x), each process requires its local portion of the vector x as well as its ghost points (the bordering portions
of the vector that are owned by neighboring processes). Figure Ghost Points for Two Stencil Types on the
Seventh Process illustrates the ghost points for the seventh process of a two-dimensional, structured parallel
grid. Each box represents a process; the ghost points for the seventh process’s local part of a parallel array
are shown in gray.
Proc 6 Proc 6
Fig. 2.1: Ghost Points for Two Stencil Types on the Seventh Process
The DMDA object contains parallel data layout information and communication information and is used to
create vectors and matrices with the proper layout.
One creates a distributed array communication data structure in two dimensions with the command
The arguments M and N indicate the global numbers of grid points in each direction, while m and n denote
the process partition in each direction; m*n must equal the number of processes in the MPI communicator,
comm. Instead of specifying the process layout, one may use PETSC_DECIDE for m and n so that PETSc
will select the partition. The type of periodicity of the array is specified by xperiod and yperiod,
which can be DM_BOUNDARY_NONE (no periodicity), DM_BOUNDARY_PERIODIC (periodic in that direction),
DM_BOUNDARY_TWIST (periodic in that direction, but identified in reverse order), DM_BOUNDARY_GHOSTED
, or DM_BOUNDARY_MIRROR. The argument dof indicates the number of degrees of freedom at each array
point, and s is the stencil width (i.e., the width of the ghost point region). The optional arrays lx and ly
may contain the number of nodes along the x and y axis for each cell, i.e. the dimension of lx is m and the
dimension of ly is n; alternately, NULL may be passed in.
Two types of distributed array communication data structures can be created, as specified by
st. Star-type stencils that radiate outward only in the coordinate directions are indicated by
DMDA_STENCIL_STAR, while box-type stencils are specified by DMDA_STENCIL_BOX. For example, for
the two-dimensional case, DMDA_STENCIL_STAR with width 1 corresponds to the standard 5-point stencil,
while DMDA_STENCIL_BOX with width 1 denotes the standard 9-point stencil. In both instances the ghost
points are identical, the only difference being that with star-type stencils certain ghost points are ignored,
decreasing substantially the number of messages sent. Note that the DMDA_STENCIL_STAR stencils can
save interprocess communication in two and three dimensions.
These DMDA stencils have nothing directly to do with a specific finite difference stencil one might chose to
use for a discretization; they only ensure that the correct values are in place for application of a user-defined
finite difference stencil (or any other discretization technique).
The commands for creating distributed array communication data structures in one and three dimensions
are analogous:
The routines to create distributed arrays are collective, so that all processes in the communicator comm must
call the same creation routines in the same order.
For structured grids with staggered data (living on elements, faces, edges, and/or vertices), the DMSTAG
object is available. It behaves much like DMDA. See DMSTAG: Staggered, Structured Grid for discussion of
creating vectors with DMSTAG.
See DMPlex: Unstructured Grids for discussion of creating vectors with DMPLEX.
To print the vector to the screen, one can use the viewer PETSC_VIEWER_STDOUT_WORLD, which ensures
that parallel vectors are printed correctly to stdout. To display the vector in an X-window, one can use
the default X-windows viewer PETSC_VIEWER_DRAW_WORLD, or one can create a viewer with the routine
PetscViewerDrawOpenX(). A variety of viewers are discussed further in Viewers: Looking at PETSc
Objects.
To create a new vector of the same format as an existing vector, one uses
To create several new vectors of the same format as an existing vector, one uses
This routine creates an array of pointers to vectors. The two routines are very useful because they allow
one to write library code that does not depend on the particular format of the vectors being used. Instead,
the subroutines can automatically correctly create work vectors based on the specified existing vector. As
discussed in Duplicating Multiple Vectors, the Fortran interface for VecDuplicateVecs() differs slightly.
When a vector is no longer needed, it should be destroyed with the command
VecDestroy(Vec *x);
Note that the Fortran interface for VecDestroyVecs() differs slightly, as described in Duplicating Multiple
Vectors.
It is also possible to create vectors that use an array provided by the user, rather than having PETSc
internally allocate the array space. Such vectors can be created with the routines such as
For GPU vectors the array pointer should be a GPU memory location.
Note that here one must provide the value n; it cannot be PETSC_DECIDE and the user is responsible for
providing enough space in the array; n*sizeof(PetscScalar).
One can assign a single value to all components of a vector with the command
Assigning values to individual components of the vector is more complicated, in order to make it possible to
write efficient parallel code. Assigning a set of components on a CPU is a two-step process: one first calls
any number of times on any or all of the processes. The argument n gives the number of components
being set in this insertion. The integer array indices contains the global component indices, and values
is the array of values to be inserted. Any process can set any components of the vector; PETSc ensures
that they are automatically stored in the correct location. Once all of the values have been inserted with
VecSetValues(), one must call
VecAssemblyBegin(Vec x);
followed by
VecAssemblyEnd(Vec x);
to perform any needed message passing of nonlocal components. In order to allow the overlap of communi-
cation and calculation, the user’s code can perform any series of other actions between these two calls while
the messages are in transition.
Again one must call the assembly routines VecAssemblyBegin() and VecAssemblyEnd() after all of
the values have been added. Note that addition and insertion calls to VecSetValues() cannot be mixed.
Instead, one must add and insert vector elements in phases, with intervening calls to the assembly routines.
This phased assembly procedure overcomes the nondeterministic behavior that would occur if two different
processes generated values for the same location, with one process adding while the other is inserting its value.
(In this case the addition and insertion actions could be performed in either order, thus resulting in different
values at the particular location. Since PETSc does not allow the simultaneous use of INSERT_VALUES
and ADD_VALUES this nondeterministic behavior will not occur in PETSc.)
You can call VecGetValues() to pull local values from a vector (but not off-process values).
For vectors obtained with DMCreateGlobalVector() on can use VecSetValuesLocal() to set values
into a global vector but using the local (ghosted) vector indexing of the vector entries.
It is also possible to interact directly with the arrays that the vector values are stored in. The routine
VecGetArray() returns a pointer to the elements local to the process:
When access to the array is no longer needed, the user should call
PetscFunctionBeginUser;
/*
Get pointers to vector data.
- For default PETSc vectors, VecGetArray() returns a pointer to
the data array. Otherwise, the routine is implementation dependent.
- You MUST call VecRestoreArray() when you no longer need access to
the array.
*/
PetscCall(VecGetArrayRead(x, &xx));
PetscCall(VecGetArray(f, &ff));
/* Compute function */
(continues on next page)
/* Restore vectors */
PetscCall(VecRestoreArrayRead(x, &xx));
PetscCall(VecRestoreArray(f, &ff));
PetscFunctionReturn(PETSC_SUCCESS);
Minor differences exist in the Fortran interface for VecGetArray() and VecRestoreArray(), as dis-
cussed in Routines that Return Fortran Allocatable Arrays. It is important to note that VecGetArray()
and VecRestoreArray() do not copy the vector elements; they merely give users direct access to the
vector elements. Thus, these routines require essentially no time to call and can be used efficiently.
For GPU vectors one can access either the values on the CPU as described above or one can call, for example,
PetscCall(VecCUDAGetArrayRead(xlocal, &xarray));
PetscCall(VecCUDAGetArrayWrite(f, &farray));
if (rank) xstartshift = 1;
else xstartshift = 0;
if (rank != size - 1) xendshift = 1;
else xendshift = 0;
PetscCall(VecGetOwnershipRange(f, &fstart, NULL));
PetscCall(VecGetLocalSize(x, &lsize));
// clang-format off
try {
thrust::for_each(
thrust::make_zip_iterator(
thrust::make_tuple(
thrust::device_ptr<PetscScalar>(farray),
thrust::device_ptr<const PetscScalar>(xarray + xstartshift),
thrust::device_ptr<const PetscScalar>(xarray + xstartshift + 1),
thrust::device_ptr<const PetscScalar>(xarray + xstartshift - 1),
thrust::counting_iterator<int>(fstart),
thrust::constant_iterator<int>(Mx),
thrust::constant_iterator<PetscScalar>(hx))),
thrust::make_zip_iterator(
thrust::make_tuple(
thrust::device_ptr<PetscScalar>(farray + lsize),
thrust::device_ptr<const PetscScalar>(xarray + lsize - xendshift),
thrust::device_ptr<const PetscScalar>(xarray + lsize - xendshift + 1),
thrust::device_ptr<const PetscScalar>(xarray + lsize - xendshift - 1),
thrust::counting_iterator<int>(fstart) + lsize,
thrust::constant_iterator<int>(Mx),
thrust::constant_iterator<PetscScalar>(hx))),
ApplyStencil());
}
or
which, in the first case, returns a GPU memory address and in the second case returns either a CPU or GPU
memory address depending on the type of the vector. For usage with GPUs one then can launch a GPU
kernel function that access the vector’s memory. When computing on GPUs VecSetValues() is not used!
One always accesses the vector’s arrays and passes them to the GPU code.
It can also be convenient to treat the vectors entries as a Kokkos view. One first creates Kokkos vectors and
then calls
The argument start indicates the first component owned by the local process, while end specifies one
more than the last owned by the local process. This command is useful, for instance, in assembling parallel
vectors.
The number of elements stored locally can be accessed with
PETSc provides an easy way to set values into the DMDA vectors and access them using the natural grid
indexing. This is done with the routines
where array is a multidimensional C array with the same dimension as da, and
where array is a multidimensional C array with one more dimension than da. The vector l can be either
a global vector or a local vector. The array is accessed using the usual global indexing on the entire grid,
but the user may only refer to the local and ghost entries of this array as all other entries are undefined. For
example, for a scalar problem in two dimensions one could use
PetscScalar **f,**u;
...
DMDAVecGetArrayRead(DM da,Vec local,&u);
DMDAVecGetArray(DM da,Vec global,&f);
...
f[i][j] = u[i][j] - ...
...
DMDAVecRestoreArrayRead(DM da,Vec local,&u);
DMDAVecRestoreArray(DM da,Vec global,&f);
PetscFunctionBeginUser;
PetscCall(DMGetLocalVector(da, &xlocal));
/*
Scatter ghost points to local vector, using the 2-step process
DMGlobalToLocalBegin(), DMGlobalToLocalEnd().
By placing code between these two statements, computations can
be done while messages are in transition.
*/
PetscCall(DMGlobalToLocalBegin(da, x, INSERT_VALUES, xlocal));
PetscCall(DMGlobalToLocalEnd(da, x, INSERT_VALUES, xlocal));
/*
Get pointers to vector data.
- The vector xlocal includes ghost point; the vectors x and f do
NOT include ghost points.
- Using DMDAVecGetArray() allows accessing the values using global ordering
*/
PetscCall(DMDAVecGetArrayRead(da, xlocal, (void *)&xx));
PetscCall(DMDAVecGetArray(da, f, &ff));
PetscCall(DMDAVecGetArrayRead(da, user->F, (void *)&FF));
/*
Get local grid boundaries (for 1-dimensional DMDA):
xs, xm - starting grid index, width of local grid (no ghost points)
*/
PetscCall(DMDAGetCorners(da, &xs, NULL, NULL, &xm, NULL, NULL));
PetscCall(DMDAGetInfo(da, NULL, &M, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,␣
,→NULL, NULL, NULL));
/*
Set function values for boundary points; define local interior grid point range:
(continues on next page)
/*
Compute function over locally owned part of the grid (interior points only)
*/
d = 1.0 / (user->h * user->h);
for (i = xs; i < xs + xm; i++) ff[i] = d * (xx[i - 1] - 2.0 * xx[i] + xx[i + 1]) +␣
,→xx[i] * xx[i] - FF[i];
/*
Restore vectors
*/
PetscCall(DMDAVecRestoreArrayRead(da, xlocal, (void *)&xx));
PetscCall(DMDAVecRestoreArray(da, f, &ff));
PetscCall(DMDAVecRestoreArrayRead(da, user->F, (void *)&FF));
PetscCall(DMRestoreLocalVector(da, &xlocal));
PetscFunctionReturn(PETSC_SUCCESS);
The recommended approach for multi-component PDEs is to declare a struct representing the fields defined
at each node of the grid, e.g.
typedef struct {
PetscScalar u,v,omega,temperature;
} Node;
Node **f,**u;
DMDAVecGetArray(DM da,Vec local,&u);
DMDAVecGetArray(DM da,Vec global,&f);
...
f[i][j].omega = ...
...
DMDAVecRestoreArray(DM da,Vec local,&u);
DMDAVecRestoreArray(DM da,Vec global,&f);
The DMDAVecGetArray routines are also provided for GPU access with CUDA, HIP, and Kokkos. For
example,
where *XX* can contain any number of *. This allows one to write very natural Kokkos multi-dimensional
parallel for kernels that act on the local portion of DMDA vectors.
PetscFunctionBeginUser;
PetscCall(DMGetLocalVector(da, &xl));
PetscCall(DMGlobalToLocal(da, x, INSERT_VALUES, xl));
d = 1.0 / (user->h * user->h);
PetscCall(DMDAGetInfo(da, NULL, &M, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,␣
,→NULL, NULL, NULL));
else R(i) = d * (X(i - 1) - 2.0 * X(i) + X(i + 1)) + X(i) * X(i) - F(i); /*␣
,→interior */
});
PetscCall(DMDAVecRestoreKokkosOffsetView(da, xl, &X));
PetscCall(DMDAVecRestoreKokkosOffsetViewWrite(da, r, &R));
PetscCall(DMDAVecRestoreKokkosOffsetView(da, user->F, &F));
PetscCall(DMRestoreLocalVector(da, &xl));
PetscFunctionReturn(PETSC_SUCCESS);
The global indices of the lower left corner of the local portion of vectors obtained from DMDA as well as the
local array size can be obtained with the commands
These values can then be used as loop bounds for local function evaluations as demonstrated in the function
examples above.
The first version excludes any ghost points, while the second version includes them. The routine DMDAGet-
GhostCorners() deals with the fact that subarrays along boundaries of the problem domain have ghost
points only on their interior edges, but not on their boundary edges.
When either type of stencil is used, DMDA_STENCIL_STAR or DMDA_STENCIL_BOX, the local vec-
tors (with the ghost points) represent rectangular arrays, including the extra corner elements in the
DMDA_STENCIL_STAR case. This configuration provides simple access to the elements by employing two- (or
three-) dimensional indexing. The only difference between the two cases is that when DMDA_STENCIL_STAR
is used, the extra corner components are not scattered between the processes and thus contain undefined
For structured grids with staggered data (living on elements, faces, edges, and/or vertices), the DMStag
object is available. It behaves much like DMDA; see the DMSTAG manual page for more information.
static PetscErrorCode UpdateVelocity_2d(const Ctx *ctx, Vec velocity, Vec stress, Vec␣
,→buoyancy)
{
Vec velocity_local, stress_local, buoyancy_local;
PetscInt ex, ey, startx, starty, nx, ny;
PetscInt slot_coord_next, slot_coord_element, slot_coord_prev;
PetscInt slot_vx_left, slot_vy_down, slot_buoyancy_down, slot_buoyancy_
,→left;
PetscFunctionBeginUser;
PetscCall(DMGetLocalVector(ctx->dm_buoyancy, &buoyancy_local));
PetscCall(DMGlobalToLocal(ctx->dm_buoyancy, buoyancy, INSERT_VALUES, buoyancy_
,→local));
PetscCall(DMGetLocalVector(ctx->dm_stress, &stress_local));
PetscCall(DMGlobalToLocal(ctx->dm_stress, stress, INSERT_VALUES, stress_local));
PetscCall(DMStagVecGetArrayRead(ctx->dm_stress, stress_local, (void *)&arr_stress));
/* Update x-velocity */
See DMPlex: Unstructured Grids for discussion on setting vector values with DMPLEX.
As listed in the table, we have chosen certain basic vector operations to support within the PETSc vector
library. These operations were selected because they often arise in application codes. The P NormType
argument to P VecNorm() is one of NORM_1, NORM_2, or NORM_INFINITY. The 1-norm is i |xi |, the
2-norm is ( i x2i )1/2 and the infinity norm is maxi |xi |.
In addition to VecDot() and VecMDot() and VecNorm(), PETSc provides split phase versions of these
that allow several independent inner products and/or norms to share the same communication (thus im-
proving parallel efficiency). For example, one may have code such as
This code works fine, but it performs four separate parallel communication operations. Instead, one can
write
With this code, the communication is delayed until the first call to VecxxxEnd() at which a single MPI
reduction is used to communicate all the required values. It is required that the calls to the VecxxxEnd()
are performed in the same order as the calls to the VecxxxBegin(); however, if you mistakenly make the
calls in the wrong order, PETSc will generate an error informing you of this. There are additional routines
VecTDotBegin() and VecTDotEnd(), VecMTDotBegin(), VecMTDotEnd().
For GPU vectors (like CUDA), the numerical computations will, by default, run on the GPU. Any scalar
output, like the result of a VecDot() are placed in CPU memory.
Many PDE problems require the use of ghost (or halo) values in each MPI rank or even more general parallel
communication of vector values. These values are needed in order to perform function evaluation on that
rank. The exact structure of the ghost values needed depends on the type of grid being used. DM provides a
uniform API for communicating the needed values. We introduce the concept in detail for DMDA.
Each DM object defines the layout of two vectors: a distributed global vector and a local vector that includes
room for the appropriate ghost points. The DM object provides information about the size and layout of
these vectors, but does not internally allocate any associated storage space for field values. Instead, the user
can create vector objects that use the DM layout information with the routines
These vectors will generally serve as the building blocks for local and global PDE solutions, etc. If additional
vectors with such layout information are needed in a code, they can be obtained by duplicating l or g via
VecDuplicate() or VecDuplicateVecs().
We emphasize that a distributed array provides the information needed to communicate the ghost value
information between processes. In most cases, several different vectors can share the same communication
information (or, in other words, can share a given DM). The design of the DM object makes this easy, as each
DM operation may operate on vectors of the appropriate size, as obtained via DMCreateLocalVector()
and DMCreateGlobalVector() or as produced by VecDuplicate().
At certain stages of many applications, there is a need to work on a local portion of the vector that includes
the ghost points. This may be done by scattering a global vector into its local parts by using the two-stage
commands
which allow the overlap of communication and computation. Since the global and local vectors, given by
g and l, respectively, must be compatible with the distributed array, da, they should be generated by
DMCreateGlobalVector() and DMCreateLocalVector() (or be duplicates of such a vector obtained
via VecDuplicate()). The InsertMode can be either ADD_VALUES or INSERT_VALUES.
One can scatter the local vectors into the distributed global vector with the command
or the commands
In general this is used with an InsertMode of ADD_VALUES, because if one wishes to insert values into
the global vector they should just access the global vector directly and put in the values.
A third type of distributed array scatter is from a local vector (including ghost points that contain irrelevant
values) to a local vector with correct ghost point values. This scatter may be done with the commands
Since both local vectors, l1 and l2, must be compatible with the distributed array, da, they should be gen-
erated by DMCreateLocalVector() (or be duplicates of such vectors obtained via VecDuplicate()).
The InsertMode can be either ADD_VALUES or INSERT_VALUES.
In most applications the local ghosted vectors are only needed during user “function evaluations”. PETSc
provides an easy, light-weight (requiring essentially no CPU time) way to obtain these work vectors and
return them when they are no longer needed. This is done with the routines
Most users of PETSc, who can utilize a DM will not need to utilize the lower-level routines discussed in the
rest of this section and should skip ahead to Matrices.
To facilitate creating general vector scatters and gathers used, for example, in updating ghost points for prob-
lems for which no DM currently exists PETSc employs the concept of an index set, via the IS class. An index
set, which is a generalization of a set of integer indices, is used to define scatters, gathers, and similar oper-
ations on vectors and matrices. In fact, much of the underlying code that implements DMGlobalToLocal
communication are built on the infrastructure discussed below.
The following command creates an index set based on a list of integers:
When mode is PETSC_COPY_VALUES, this routine copies the n indices passed to it by the integer array
indices. Thus, the user should be sure to free the integer array indices when it is no longer needed,
perhaps directly after the call to ISCreateGeneral(). The communicator, comm, should consist of all
processes that will be using the IS.
Another standard index set is defined by a starting point (first) and a stride (step), and can be created
with the command
ISDestroy(IS &is);
On rare occasions the user may need to access information directly from an index set. Several commands
assist in this process:
The function ISGetIndices() returns a pointer to a list of the indices in the index set. For certain index
sets, this may be a temporary array of indices created specifically for a given routine. Thus, once the user
finishes using the array of indices, the routine
should be called to ensure that the system can free the space it may have used to generate the list of indices.
A blocked version of the index sets can be created with the command
This version is used for defining operations in which each element of the index set refers to a block of
bs vector entries. Related routines analogous to those described above exist as well, including ISBlock-
GetIndices(), ISBlockGetSize(), ISBlockGetLocalSize(), ISGetBlockSize(). See the man
pages for details.
Most PETSc applications use a particular DM object to manage the details of the communication needed for
their grids. In some rare cases however codes need to directly setup their required communication patterns.
This is done using PETSc’s VecScatter and PetscSF (for more general data than vectors). One can select
any subset of the components of a vector to insert or add to any subset of the components of another vector.
We refer to these operations as generalized scatters, though they are actually a combination of scatters and
gathers.
To copy selected components from one vector to another, one uses the following set of commands:
Here ix denotes the index set of the first vector, while iy indicates the index set of the destination vector.
The vectors can be parallel or sequential. The only requirements are that the number of entries in the
index set of the first vector, ix, equals the number in the destination index set, iy, and that the vectors
be long enough to contain all the indices referred to in the index sets. If both x and y are parallel,
their communicator must have the same set of processes, but their process order can be different. The
argument INSERT_VALUES specifies that the vector elements will be inserted into the specified locations
of the destination vector, overwriting any existing values. To add the components, rather than insert them,
the user should select the option ADD_VALUES instead of INSERT_VALUES. One can also use MAX_VALUES
or MIN_VALUES to replace destination with the maximal or minimal of its current value and the scattered
values.
To perform a conventional gather operation, the user simply makes the destination index set, iy, be a stride
index set with a stride of one. Similarly, a conventional scatter can be done with an initial (sending) index
set consisting of a stride. The scatter routines are collective operations (i.e. all processes that own a parallel
vector must call the scatter routines). When scattering from a parallel vector to sequential vectors, each
process has its own sequential vector that receives values from locations as indicated in its own index set.
Similarly, in scattering from sequential vectors to a parallel vector, each process has its own sequential vector
that makes contributions to the parallel vector.
Caution: When INSERT_VALUES is used, if two different processes contribute different values to the same
component in a parallel vector, either value may end up being inserted. When ADD_VALUES is used, the
correct sum is added to the correct location.
In some cases one may wish to “undo” a scatter, that is perform the scatter backwards, switching the roles
of the sender and receiver. This is done by using
Note that the roles of the first two arguments to these routines must be swapped whenever the SCAT-
TER_REVERSE option is used.
Once a VecScatter object has been created it may be used with any vectors that have the appropriate
parallel data layout. That is, one can call VecScatterBegin() and VecScatterEnd() with different
vectors than used in the call to VecScatterCreate() as long as they have the same parallel layout (number
of elements on each process are the same). Usually, these “different” vectors would have been obtained via
calls to VecDuplicate() from the original vectors used in the call to VecScatterCreate().
VecGetValues() can only access local values from the vector. To get off-process values, the user should
create a new vector where the components are to be stored, and then perform the appropriate vector scatter.
For example, if one desires to obtain the values of the 100th and 200th entries of a parallel vector, p, one
could use a code such as that below. In this example, the values of the 100th and 200th components are
placed in the array values. In this example each process now has the 100th and 200th component, but
obviously each process could gather any elements it needed, or none by creating an index set with no entries.
VecCreateSeq(PETSC_COMM_SELF,2,&x);
ISCreateGeneral(PETSC_COMM_SELF,2,idx_from,PETSC_COPY_VALUES,&from);
ISCreateGeneral(PETSC_COMM_SELF,2,idx_to,PETSC_COPY_VALUES,&to);
VecScatterCreate(p,from,x,to,&scatter);
VecScatterBegin(scatter,p,x,INSERT_VALUES,SCATTER_FORWARD);
VecScatterEnd(scatter,p,x,INSERT_VALUES,SCATTER_FORWARD);
VecGetArray(x,&values);
ISDestroy(&from);
ISDestroy(&to);
VecScatterDestroy(&scatter);
The scatter comprises two stages, in order to allow overlap of communication and computation. The intro-
duction of the VecScatter context allows the communication patterns for the scatter to be computed once
and then reused repeatedly. Generally, even setting up the communication for a scatter requires communi-
cation; hence, it is best to reuse such information when possible.
Generalized scatters provide a very general method for managing the communication of required ghost
values for unstructured grid computations. One scatters the global vector into a local “ghosted” work
vector, performs the computation on the local work vectors, and then scatters back into the global solution
vector. In the simplest case this may be written as
When working with a global representation of a vector (usually on a vector obtained with DMCreateGlob-
alVecgtor()) and a local representation of the same vector that includes ghost points required for local
computation (obtained with DMCreateLocalVecgtor()). PETSc provides routines to help map indices
from a local numbering scheme to the PETSc global numbering scheme. This is done via the following
routines
Here N denotes the number of local indices, globalnum contains the global number of each local num-
ber, and ISLocalToGlobalMapping is the resulting PETSc object that contains the information needed
to apply the mapping with either ISLocalToGlobalMappingApply() or ISLocalToGlobalMappin-
gApplyIS().
Note that the ISLocalToGlobalMapping routines serve a different purpose than the AO routines. In the
former case they provide a mapping from a local numbering scheme (including ghost points) to a global
numbering scheme, while in the latter they provide a mapping between two global numbering schemes. In
fact, many applications may use both AO and ISLocalToGlobalMapping routines. The AO routines are
first used to map from an application global ordering (that has no relationship to parallel processing etc.) to
the PETSc ordering scheme (where each process has a contiguous set of indices in the numbering). Then in
order to perform function or Jacobian evaluations locally on each process, one works with a local numbering
scheme that includes ghost points. The mapping from this local numbering scheme back to the global PETSc
numbering can be handled with the ISLocalToGlobalMapping routines.
If one is given a list of block indices in a global numbering, the routine
ISGlobalToLocalMappingApplyBlock(ISLocalToGlobalMapping ctx,
,→ISGlobalToLocalMappingMode type,PetscInt nin,PetscInt idxin[],PetscInt *nout,
,→PetscInt idxout[]);
will provide a new list of indices in the local numbering. Again, negative values in idxin are left unmapped.
But, in addition, if type is set to IS_GTOLM_MASK , then nout is set to nin and all global values in
idxin that are not represented in the local to global mapping are replaced by -1. When type is set to
IS_GTOLM_DROP, the values in idxin that are not represented locally in the mapping are not included
in idxout, so that potentially nout is smaller than nin. One must pass in an array long enough to
hold all the indices. One can call ISGlobalToLocalMappingApplyBlock() with idxout equal to
NULL to determine the required length (returned in nout) and then allocate the required space and call
ISGlobalToLocalMappingApplyBlock() a second time to set the values.
Often it is convenient to set elements into a vector using the local node numbering rather than the global
node numbering (e.g., each process may maintain its own sublist of vertices and elements and number them
locally). To set values into a vector with the local numbering, one must first call
Now the indices use the local numbering, rather than the global, meaning the entries lie in [0, n) where
n is the local size of the vector. Global vectors obtained from DM``s already have the global to
local mapping provided by the ``DM.
To assemble global stiffness matrices, one can use these global indices with MatSetValues() or MatSet-
ValuesStencil(). Alternately, the global node number of each local node, including the ghost nodes,
can be obtained by calling
DMGetLocalToGlobalMapping(DM da,ISLocalToGlobalMapping *map);
followed by
VecSetLocalToGlobalMapping(Vec v,ISLocalToGlobalMapping map);
MatSetLocalToGlobalMapping(Mat A,ISLocalToGlobalMapping rmapping,
,→ISLocalToGlobalMapping cmapping);
Now entries may be added to the vector and matrix using the local numbering and VecSetValuesLocal()
and MatSetValuesLocal().
The example SNES Tutorial ex5 illustrates the use of a distributed array in the solution of a nonlinear
problem. The analogous Fortran program is SNES Tutorial ex5f90; see SNES: Nonlinear Solvers for a
discussion of the nonlinear solvers.
There are two minor drawbacks to the basic approach described above for unstructured grids:
• the extra memory requirement for the local work vector, localin, which duplicates the memory in
globalin, and
• the extra time required to copy the local values from localin to globalin.
An alternative approach is to allocate global vectors with space preallocated for the ghost values; at the
local level vector interfaces this may be done with either
VecCreateGhost(MPI_Comm comm,PetscInt n,PetscInt N,PetscInt nghost,PetscInt *ghosts,
,→Vec *vv)
or
VecCreateGhostWithArray(MPI_Comm comm,PetscInt n,PetscInt N,PetscInt nghost,PetscInt␣
,→*ghosts,PetscScalar *array,Vec *vv)
Here n is the number of local vector entries, N is the number of global entries (or NULL) and nghost is
the number of ghost entries. The array ghosts is of size nghost and contains the global vector location
for each local ghost location. Using VecDuplicate() or VecDuplicateVecs() on a ghosted vector will
generate additional ghosted vectors.
In many ways, a ghosted vector behaves just like any other MPI vector created by VecCreateMPI(). The
difference is that the ghosted vector has an additional “local” representation that allows one to access the
ghost locations. This is done through the call to
VecGhostGetLocalForm(Vec g,Vec *l);
The vector l is a sequential representation of the parallel vector g that shares the same array space (and
hence numerical values); but allows one to access the “ghost” values past “the end of the” array. Note that
one access the entries in l using the local numbering of elements and ghosts, while they are accessed in g
using the global numbering.
A common usage of a ghosted vector is given by
When writing parallel PDE codes, there is extra complexity caused by having multiple ways of indexing
(numbering) and ordering objects such as vertices and degrees of freedom. For example, a grid generator or
partitioner may renumber the nodes, requiring adjustment of the other data structures that refer to these
objects; see Figure Natural Ordering and PETSc Ordering for a 2D Distributed Array (Four Processes).
PETSc provides a variety of tools to help to manage the mapping amongst the various numbering systems.
The most basic are the AO (application ordering), which enables mapping between different global (cross-
process) numbering schemes.
In many applications it is desirable to work with one or more “orderings” (or numberings) of degrees of
freedom, cells, nodes, etc. Doing so in a parallel environment is complicated by the fact that each process
cannot keep complete lists of the mappings between different orderings. In addition, the orderings used in
the PETSc linear algebra routines (often contiguous ranges) may not correspond to the “natural” orderings
for the application.
PETSc provides certain utility routines that allow one to deal cleanly and efficiently with the various order-
ings. To define a new application ordering (called an AO in PETSc), one can call the routine
The arrays apordering and petscordering, respectively, contain a list of integers in the application
ordering and their corresponding mapped values in the PETSc ordering. Each process can provide whatever
subset of the ordering it chooses, but multiple processes should never contribute duplicate values. The
argument n indicates the number of local contributed values.
For example, consider a vector of length 5, where node 0 in the application ordering corresponds to node 3
in the PETSc ordering. In addition, nodes 1, 2, 3, and 4 of the application ordering correspond, respectively,
The user can create the PETSc AO mappings in a number of ways. For example, if using two processes, one
could call
AOCreateBasic(PETSC_COMM_WORLD,2,{0,3},{3,4},&ao);
Upon input, the n-dimensional array indices specifies the indices to be mapped, while upon output,
indices contains the mapped values. Since we, in general, employ a parallel database for the AO mappings,
it is crucial that all processes that called AOCreateBasic() also call these routines; these routines cannot be
called by just a subset of processes in the MPI communicator that was used in the call to AOCreateBasic().
An alternative routine to create the application ordering, AO, is
AOCreateBasicIS(IS apordering,IS petscordering,AO *ao);
will map index sets (IS objects) between orderings. Both the AOXxxToYyy() and AOXxxToYyyIS()
routines can be used regardless of whether the AO was created with a AOCreateBasic() or AOCreate-
BasicIS().
The AO context should be destroyed with AODestroy(AO *ao) and viewed with AOView(AO ao,
PetscViewer viewer).
Although we refer to the two orderings as “PETSc” and “application” orderings, the user is free to use them
both for application orderings and to maintain relationships among a variety of orderings by employing
several AO contexts.
The AOxxToxx() routines allow negative entries in the input integer array. These entries are not mapped;
they simply remain unchanged. This functionality enables, for example, mapping neighbor lists that use
negative numbers to indicate nonexistent neighbors due to boundary conditions, etc.
Since the global ordering that PETSc uses to manage its parallel vectors (and matrices) does not usually
correspond to the “natural” ordering of a two- or three-dimensional array, the DMDA structure provides an
application ordering AO (see Application Orderings) that maps between the natural ordering on a rectangular
grid and the ordering PETSc uses to parallelize. This ordering context can be obtained with the command
DMDAGetAO(DM da,AO *ao);
In Figure Natural Ordering and PETSc Ordering for a 2D Distributed Array (Four Processes) we indicate
the orderings for a two-dimensional distributed array, divided among four processes.
26 27 28 29 30 22 23 24 29 30
21 22 23 24 25 19 20 21 27 28
16 17 18 19 20 16 17 18 25 26
11 12 13 14 15 7 8 9 14 15
6 7 8 9 10 4 5 6 12 13
1 2 3 4 5 1 2 3 10 11
Fig. 2.2: Natural Ordering and PETSc Ordering for a 2D Distributed Array (Four Processes)
2.2 Matrices
PETSc provides a variety of matrix implementations because no single matrix format is appropriate for all
problems. Currently, we support dense storage and compressed sparse row storage (both sequential and
parallel versions) for CPU and GPU based matrices, as well as several specialized formats. Additional
specialized formats can be easily added.
This chapter describes the basics of using PETSc matrices in general (regardless of the particular format
chosen) and discusses tips for efficient use of the several simple uniprocess and parallel matrix types. The
use of PETSc matrices involves the following actions: create a particular type of matrix, insert values into it,
process the matrix, use the matrix for various computations, and finally destroy the matrix. The application
code does not need to know or care about the particular storage formats of the matrices.
As with vectors, PETSc has APIs that allow the user to specify the exact details of the matrix creation
process but also DM based creation routines that handle most of the details automatically for specific families
of applications. This is done with
or with
DMSetSetFromOptions(DM dm)
and the options database option -dm_mat_type <aij or baij or aijcusparse etc> Matrices can
be created for CPU usage, for GPU usage and for usage on both the CPUs and GPUs.
2.2. Matrices 53
PETSc/TAO Users Manual, Release 3.19.2
The creation of DM objects is discussed in DMDA - Creating vectors for structured grids, DMPLEX - Creating
vectors for unstructured grids, DMNETWORK - Creating vectors for networks.
When using a DM is not practical for a particular application one can create matrices directly using
This routine generates a sequential matrix when running one process and a parallel matrix for two or more
processes; the particular matrix format is set by the user via options database commands. The user specifies
either the global matrix dimensions, given by M and N or the local dimensions, given by m and n while PETSc
completely controls memory allocation. This routine facilitates switching among various matrix types, for
example, to determine the format that is most efficient for a certain application. By default, MatCreate()
employs the sparse AIJ format, which is discussed in detail in Sparse Matrices. See the manual pages for
further information about available matrix formats.
To insert or add entries to a matrix on CPUs, one can call a variant of MatSetValues(), either
or
This routine inserts or adds a logically dense subblock of dimension m*n into the matrix. The integer indices
idxm and idxn, respectively, indicate the global row and column numbers to be inserted. MatSetValues()
uses the standard C convention, where the row and column matrix indices begin with zero regardless of the
programming language employed. The array values is logically two-dimensional, containing the values
that are to be inserted. By default the values are given in row major order, which is the opposite of the
Fortran convention, meaning that the value to be put in row idxm[i] and column idxn[j] is located in
values[i*n+j]. To allow the insertion of values in column major order, one can call the command
MatSetOption(Mat A,MAT_ROW_ORIENTED,PETSC_FALSE);
Warning: Several of the sparse implementations do not currently support the column-oriented option.
This notation should not be a mystery to anyone. For example, to insert one matrix into another when using
MATLAB, one uses the command A(im,in) = B; where im and in contain the indices for the rows and
columns. This action is identical to the calls above to MatSetValues().
When using the block compressed sparse row matrix format (MATSEQBAIJ or MATMPIBAIJ), one can
insert elements more efficiently using the block variant, MatSetValuesBlocked() or MatSetValues-
BlockedLocal().
The function MatSetOption() accepts several other inputs; see the manual page for details.
After the matrix elements have been inserted or added into the matrix, they must be processed (also called
“assembled”) before they can be used. The routines for matrix processing are
MatAssemblyBegin(Mat A,MAT_FINAL_ASSEMBLY);
MatAssemblyEnd(Mat A,MAT_FINAL_ASSEMBLY);
By placing other code between these two calls, the user can perform computations while messages are in
transit. Calls to MatSetValues() with the INSERT_VALUES and ADD_VALUES options cannot be mixed
without intervening calls to the assembly routines. For such intermediate assembly calls the second routine
argument typically should be MAT_FLUSH_ASSEMBLY, which omits some of the work of the full assembly
process. MAT_FINAL_ASSEMBLY is required only in the last matrix assembly before a matrix is used.
Even though one may insert values into PETSc matrices without regard to which process eventually stores
them, for efficiency reasons we usually recommend generating most entries on the process where they are
destined to be stored. To help the application programmer with this task for matrices that are distributed
across the processes by ranges, the routine
informs the user that all rows from first_row to last_row-1 (since the value returned in last_row is
one more than the global index of the last local row) will be stored on the local process.
In the sparse matrix implementations, once the assembly routines have been called, the matrices are com-
pressed and can be used for matrix-vector multiplication, etc. Any space for preallocated nonzeros that
was not filled by a call to MatSetValues() or a related routine is compressed out by assembling with
MAT_FINAL_ASSEMBLY. If you intend to use that extra space later, be sure to insert explicit zeros before
assembling with MAT_FINAL_ASSEMBLY so the space will not be compressed out. Once the matrix has been
assembled, inserting new values will be expensive since it will require copies and possible memory allocation.
One may repeatedly assemble matrices that retain the same nonzero pattern (such as within a nonlinear or
time-dependent problem). Where possible, data structures and communication information will be reused
(instead of regenerated) during successive steps, thereby increasing efficiency. See KSP Tutorial ex5 for a
simple example of solving two linear systems that use the same matrix data structure.
For matrices associated with DMDA there is a higher-level interface for providing the numerical values based
on the concept of stencils. See the manual page of MatSetValuesStencil() for usage.
For GPUs the routines MatSetPreallocationCOO() and MatSetValuesCOO() should be used for
efficient matrix assembly instead of MatSetValues().
We now introduce the various families of PETSc matrices. DMCreateMatrix() manages the preallocation
process (introduced below) automatically so many users do not need to worry about the details of the
preallocation process.
The layout of PETSc matrices across MPI ranks is defined by two things
• the layout of the two compatible vectors in the computation of the matrix-vector product y = A * x
and
• the memory where various parts of the matrix are stored across the MPI ranks.
PETSc vectors always have a contiguous range of vector entries stored on each MPI rank. The first rank
has entries from 0 to rend1 - 1, the next rank has entries from rend1 to rend2 - 1, etc. Thus the
ownership range on each rank is from rstart to rend, these values can be obtained with VecGetOwner-
shipRange(Vec x, PetscInt * rstart, PetscInt * rend). Each PETSc Vec has a PetscLayout
object that contains this information.
All PETSc matrices have two PetscLayouts, they define the vector layouts for y and x in the product, y = A
* x. Their ownership range information can be obtained with MatGetOwnershipRange(), MatGetOwn-
2.2. Matrices 55
PETSc/TAO Users Manual, Release 3.19.2
Sparse Matrices
The default matrix representation within PETSc is the general sparse AIJ format (also called the compressed
sparse row format, CSR). This section discusses tips for efficiently using this matrix format for large-scale
applications. Additional formats (such as block compressed row and block symmetric storage, which are
generally much more efficient for problems with multiple degrees of freedom per node) are discussed below.
Beginning users need not concern themselves initially with such details and may wish to proceed directly to
Basic Matrix Operations. However, when an application code progresses to the point of tuning for efficiency
and/or generating timing results, it is crucial to read this information.
In the PETSc AIJ matrix formats, we store the nonzero elements by rows, along with an array of corre-
sponding column numbers and an array of pointers to the beginning of each row. Note that the diagonal
matrix entries are stored with the rest of the nonzeros (not separately).
To create a sequential AIJ sparse matrix, A, with m rows and n columns, one uses the command
where nz or nnz can be used to preallocate matrix memory, as discussed below. The user can set nz=0 and
nnz=NULL for PETSc to control all matrix memory allocation.
The sequential and parallel AIJ matrix storage formats by default employ i-nodes (identical nodes) when
possible. We search for consecutive rows with the same nonzero structure, thereby reusing matrix information
for increased efficiency. Related options database keys are -mat_no_inode (do not use i-nodes) and
-mat_inode_limit <limit> (set i-node limit (max limit=5)). Note that problems with a single degree
of freedom per grid node will automatically not use i-nodes.
The internal data representation for the AIJ formats employs zero-based indexing.
The dynamic process of allocating new memory and copying from the old storage to the new is intrinsically
very expensive. Thus, to obtain good performance when assembling an AIJ matrix, it is crucial to preallocate
the memory needed for the sparse matrix. The user has two choices for preallocating matrix memory via
MatCreateSeqAIJ().
One can use the scalar nz to specify the expected number of nonzeros for each row. This is generally fine
if the number of nonzeros per row is roughly the same throughout the matrix (or as a quick and easy first
step for preallocation). If one underestimates the actual number of nonzeros in a given row, then during the
assembly process PETSc will automatically allocate additional needed space. However, this extra memory
allocation can slow the computation.
If different rows have very different numbers of nonzeros, one should attempt to indicate (nearly) the exact
number of elements intended for the various rows with the optional array, nnz of length m, where m is the
number of rows, for example
PetscInt nnz[m];
nnz[0] = <nonzeros in row 0>
nnz[1] = <nonzeros in row 1>
....
nnz[m-1] = <nonzeros in row m-1>
In this case, the assembly process will require no additional memory allocations if the nnz estimates are
correct. If, however, the nnz estimates are incorrect, PETSc will automatically obtain the additional needed
space, at a slight loss of efficiency.
Using the array nnz to preallocate memory is especially important for efficient matrix assembly if the number
of nonzeros varies considerably among the rows. One can generally set nnz either by knowing in advance the
problem structure (e.g., the stencil for finite difference problems on a structured grid) or by precomputing
the information by using a segment of code similar to that for the regular matrix assembly. The overhead
of determining the nnz array will be quite small compared with the overhead of the inherently expensive
mallocs and moves of data that are needed for dynamic allocation during matrix assembly. Always guess
high if an exact value is not known (extra space is cheaper than too little).
Thus, when assembling a sparse matrix with very different numbers of nonzeros in various rows, one could
proceed as follows for finite difference methods:
1. Allocate integer array nnz.
2. Loop over grid, counting the expected number of nonzeros for the row(s) associated with the various
grid points.
3. Create the sparse matrix via MatCreateSeqAIJ() or alternative.
4. Loop over the grid, generating matrix entries and inserting in matrix via MatSetValues().
For (vertex-based) finite element type calculations, an analogous procedure is as follows:
1. Allocate integer array nnz.
2. Loop over vertices, computing the number of neighbor vertices, which determines the number of nonze-
ros for the corresponding matrix row(s).
3. Create the sparse matrix via MatCreateSeqAIJ() or alternative.
4. Loop over elements, generating matrix entries and inserting in matrix via MatSetValues().
The -info option causes the routines MatAssemblyBegin() and MatAssemblyEnd() to print infor-
mation about the success of the preallocation. Consider the following example for the MATSEQAIJ matrix
format:
2.2. Matrices 57
PETSc/TAO Users Manual, Release 3.19.2
The first line indicates that the user preallocated 120 spaces but only 100 were used. The second line indicates
that the user preallocated enough space so that PETSc did not have to internally allocate additional space
(an expensive operation). In the next example the user did not preallocate sufficient space, as indicated by
the fact that the number of mallocs is very large (bad for efficiency):
Although at first glance such procedures for determining the matrix structure in advance may seem unusual,
they are actually very efficient because they alleviate the need for dynamic construction of the matrix data
structure, which can be very expensive.
Parallel sparse matrices with the AIJ format can be created with the command
A is the newly created matrix, while the arguments m, M, and N, indicate the number of local rows and the
number of global rows and columns, respectively. In the PETSc partitioning scheme, all the matrix columns
are local and n is the number of columns corresponding to the local part of a parallel vector. Either the
local or global parameters can be replaced with PETSC_DECIDE, so that PETSc will determine them. The
matrix is stored with a fixed number of rows on each process, given by m, or determined by PETSc if m is
PETSC_DECIDE.
If PETSC_DECIDE is not used for the arguments m and n, then the user must ensure that they are chosen
to be compatible with the vectors. To do this, one first considers the matrix-vector product y = Ax. The m
that is used in the matrix creation routine MatCreateAIJ() must match the local size used in the vector
creation routine VecCreateMPI() for y. Likewise, the n used must match that used as the local size in
VecCreateMPI() for x.
The user must set d_nz=0, o_nz=0, d_nnz=NULL, and o_nnz=NULL for PETSc to control dynamic
allocation of matrix memory space. Analogous to nz and nnz for the routine MatCreateSeqAIJ(), these
arguments optionally specify nonzero information for the diagonal (d_nz and d_nnz) and off-diagonal (o_nz
and o_nnz) parts of the matrix. For a square global matrix, we define each process’s diagonal portion to
be its local rows and the corresponding columns (a square submatrix); each process’s off-diagonal portion
encompasses the remainder of the local matrix (a rectangular submatrix). The rank in the MPI communicator
determines the absolute ordering of the blocks. That is, the process with rank 0 in the communicator given
to MatCreateAIJ() contains the top rows of the matrix; the ith process in that communicator contains
the ith block of the matrix.
As discussed above, preallocation of memory is critical for achieving good performance during matrix assem-
bly, as this reduces the number of allocations and copies required. We present an example for three processes
to indicate how this may be done for the MATMPIAIJ matrix format. Consider the 8 by 8 matrix, which is
partitioned by default with three rows on the first process, three on the second and two on the third.
1 2 0 | 0 3 0 | 0 4
0 5 6 | 7 0 0 | 8 0
9 0 10 | 11 0 0 | 12 0
13 0 14 | 15 16 17 | 0 0
0 18 0 | 19 20 21 | 0 0
0 0 0 | 22 23 0 | 24 0
25 26 27 | 0 0 28 | 29 0
30 0 0 | 31 32 33 | 0 34
For the first process one could set d_nz to 2 (since each row has 2 nonzeros) or, alternatively, set d_nnz to
{2, 2, 2}. The o_nz could be set to 2 since each row of the o matrix has 2 nonzeros, or o_nnz could be set
to {2, 2, 2}.
For the second process the d submatrix is given by
15 16 17
19 20 21 .
22 23 0
Thus, one could set d_nz to 3, since the maximum number of nonzeros in each row is 3, or alternatively one
could set d_nnz to {3, 3, 2}, thereby indicating that the first two rows will have 3 nonzeros while the third
has 2. The corresponding o submatrix for the second process is
13 0 14 0 0
0 18 0 0 0
0 0 0 24 0
2.2. Matrices 59
PETSc/TAO Users Manual, Release 3.19.2
indicate that very few values have been generated on different processes. On the other hand, the statements
MatAssemblyBegin_MPIAIJ:Stash has 100000 entries, uses 100 mallocs
MatAssemblyBegin_MPIAIJ:Stash has 77777 entries, uses 70 mallocs
indicate that many values have been generated on the “wrong” processes. This situation can be very
inefficient, since the transfer of values to the “correct” process is generally expensive. By using the command
MatGetOwnershipRange() in application codes, the user should be able to generate most entries on the
owning process.
Note: It is fine to generate some entries on the “wrong” process. Often this can lead to cleaner, simpler, less
buggy codes. One should never make code overly complicated in order to generate all values locally. Rather,
one should organize the code in such a way that most values are generated locally.
The routine MatCreateAIJCUSPARSE() allows one to create GPU based matrices for NVIDIA systems.
MatCreateAIJKokkos() can create matrices for use with CPU, OpenMP, NVIDIA, AMD, or Intel based
GPU systems.
It is sometimes difficult to compute the required preallocation information efficiently, hence PETSc provides
a special MatType, MATPREALLOCATOR that helps make computing this information more straightforward.
One first creates a matrix of this type and then, using the same code that one would use to actually compute
the matrices numerical values, calls MatSetValues() for this matrix, without needing to provide any
preallocation information (one need not provide the matrix numerical values). Once this is complete one uses
MatPreallocatorPreallocate() to provide the accumulated preallocation information to the actual
matrix one will use for the computations. We hope to simplify this process in the future, allowing the removal
of MATPREALLOCATOR, instead simply allowing the use of its efficient insertion process automatically during
the first assembly of any matrix type directly without requiring the detailed preallocation information.
See doc_matrix for a table of the matrix types available in PETSc.
Variable metric methods, also known as quasi-Newton methods, are frequently used for root finding problems
and approximate Jacobian matrices or their inverses via sequential nonlinear updates based on the secant
condition. The limited-memory variants do not store the full explicit Jacobian, and instead compute forward
products and inverse applications based on a fixed number of stored update vectors.
PETSc implements seven different LMVM matrices listed in the table above. They can be created using the
MatCreate() and MatSetType() workflow, and share a number of common interface functions. We will
review the most important ones below:
• MatLMVMAllocate(Mat B, Vec X, Vec F) – Creates the internal data structures necessary to
store nonlinear updates and compute forward/inverse applications. The X vector defines the solution
space while the F defines the function space for the history of updates.
• MatLMVMUpdate(Mat B, Vec X, Vec F) – Applies a nonlinear update to the approximate
Jacobian such that sk = xk − xk−1 and yk = f (xk ) − f (xk−1 ), where k is the index for the update.
• MatLMVMReset(Mat B, PetscBool destructive) – Flushes the accumulated nonlinear updates
and resets the matrix to the initial state. If destructive = PETSC_TRUE, the reset also destroys
the internal data structures and necessitates another allocation call before the matrix can be updated
and used for products and solves.
• MatLMVMSetJ0(Mat B, Mat J0) – Defines the initial Jacobian to apply the updates to. If no
initial Jacobian is provided, the updates are applied to an identity matrix.
LMVM matrices can be applied to vectors in forward mode via MatMult() or MatMultAdd(), and in in-
verse mode via MatSolve(). They also support MatCreateVecs(), MatDuplicate() and MatCopy()
operations.
Restricted Broyden Family, DFP and BFGS methods additionally implement special Jacobian initialization
and scaling options available via -mat_lmvm_scale_type <none,scalar,diagonal>. We describe
these choices below:
• none – Sets the initial Jacobian to be equal to the identity matrix. No extra computations are required
when obtaining the search direction or updating the approximation. However, the number of function
evaluations required to converge the Newton solution is typically much larger than what is required
when using other initializations.
• scalar – Defines the initial Jacobian as a scalar multiple of the identity matrix. The scalar value σ
is chosen by solving the one dimensional optimization problem
where S and Y are the matrices whose columns contain a subset of update vectors sk and yk , and
α ∈ [0, 1] is defined by the user via -mat_lmvm_alpha and has a different default value for each
LMVM implementation (e.g.: default α = 1 for BFGS produces the well-known ykT sk /ykT yk scalar
initialization). The number of updates to be used in the S and Y matrices is 1 by default (i.e.: the
latest update only) and can be changed via -mat_lmvm_sigma_hist. This technique is inspired by
Gilbert and Lemarechal [ref-GL89].
• diagonal – Uses a full-memory restricted Broyden update formula to construct a diagonal matrix for
the Jacobian initialization. Although the full-memory formula is utilized, the actual memory footprint
is restricted to only the vector representing the diagonal and some additional work vectors used in its
construction. The diagonal terms are also re-scaled with every update as suggested in [ref-GL89]. This
initialization requires the most computational effort of the available choices but typically results in a
significant reduction in the number of function evaluations taken to compute a solution.
Note that the user-provided initial Jacobian via MatLMVMSetJ0() overrides and disables all built-in ini-
tialization methods.
2.2. Matrices 61
PETSc/TAO Users Manual, Release 3.19.2
Dense Matrices
PETSc provides both sequential and parallel dense matrix formats, where each process stores its entries in a
column-major array in the usual Fortran style. To create a sequential, dense PETSc matrix, A of dimensions
m by n, the user should call
MatCreateSeqDense(PETSC_COMM_SELF,PetscInt m,PetscInt n,PetscScalar *data,Mat *A);
The variable data enables the user to optionally provide the location of the data for matrix storage (intended
for Fortran users who wish to allocate their own storage space). Most users should merely set data to NULL
for PETSc to control matrix memory allocation. To create a parallel, dense matrix, A, the user should call
MatCreateDense(MPI_Comm comm,PetscInt m,PetscInt n,PetscInt M,PetscInt N,PetscScalar␣
,→*data,Mat *A)
The arguments m, n, M, and N, indicate the number of local rows and columns and the number of global rows
and columns, respectively. Either the local or global parameters can be replaced with PETSC_DECIDE, so
that PETSc will determine them. The matrix is stored with a fixed number of rows on each process, given
by m, or determined by PETSc if m is PETSC_DECIDE.
PETSc does not provide parallel dense direct solvers, instead interfacing to external packages that provide
these solvers. Our focus is on sparse iterative solvers.
Block Matrices
Block matrices arise when coupling variables with different meaning, especially when solving problems with
constraints (e.g. incompressible flow) and “multi-physics” problems. Usually the number of blocks is small
and each block is partitioned in parallel. We illustrate for a 3 × 3 system with components labeled a, b, c.
With some numbering of unknowns, the matrix could be written as
Aaa Aab Aac
Aba Abb Abc .
Aca Acb Acc
There are two fundamentally different ways that this matrix could be stored, as a single assembled sparse
matrix where entries from all blocks are merged together (“monolithic”), or as separate assembled matrices for
each block (“nested”). These formats have different performance characteristics depending on the operation
being performed. In particular, many preconditioners require a monolithic format, but some that are very
effective for solving block systems (see Solving Block Matrices) are more efficient when a nested format is
used. In order to stay flexible, we would like to be able to use the same code to assemble block matrices
in both monolithic and nested formats. Additionally, for software maintainability and testing, especially in
a multi-physics context where different groups might be responsible for assembling each of the blocks, it is
desirable to be able to use exactly the same code to assemble a single block independently as to assemble it
as part of a larger system. To do this, we introduce the four spaces shown in Fig. 2.3.
• The monolithic global space is the space in which the Krylov and Newton solvers operate, with collective
semantics across the entire block system.
• The split global space splits the blocks apart, but each split still has collective semantics.
• The split local space adds ghost points and separates the blocks. Operations in this space can be
performed with no parallel communication. This is often the most natural, and certainly the most
powerful, space for matrix assembly code.
• The monolithic local space can be thought of as adding ghost points to the monolithic global space,
but it is often more natural to use it simply as a concatenation of split local spaces on each process. It
is not common to explicitly manipulate vectors or matrices in this space (at least not during assembly),
but it is a useful for declaring which part of a matrix is being assembled.
Split Local
rank 0 LocalToGlobalMapping
LocalToGlobal()
GetLocalSubMatrix()
rank 0
rank 1 rank 1
rank 1
rank 2
rank 2
rank 2
GetSubMatrix() / GetSubVector()
Fig. 2.3: The relationship between spaces used for coupled assembly.
2.2. Matrices 63
PETSc/TAO Users Manual, Release 3.19.2
which provides a “view” submat into a matrix A that operates in the monolithic global space. The submat
transforms from the split local space defined by iscol to the split local space defined by isrow. The
index sets specify the parts of the monolithic local space that submat should operate in. If a nested matrix
format is used, then MatGetLocalSubMatrix() finds the nested block and returns it without making
any copies. In this case, submat is fully functional and has a parallel communicator. If a monolithic
matrix format is used, then MatGetLocalSubMatrix() returns a proxy matrix on PETSC_COMM_SELF
that does not provide values or implement MatMult(), but does implement MatSetValuesLocal()
and, if isrow,iscol have a constant block size, MatSetValuesBlockedLocal(). Note that although
submat may not be a fully functional matrix and the caller does not even know a priori which communicator
it will reside on, it always implements the local assembly functions (which are not collective). The index
sets isrow,iscol can be obtained using DMCompositeGetLocalISs() if DMCOMPOSITE is being used.
DMCOMPOSITE can also be used to create matrices, in which case the MATNEST format can be specified
using -prefix_dm_mat_type nest and MATAIJ can be specified using -prefix_dm_mat_type aij.
See SNES Tutorial ex28 for a simple example using this interface.
Table 2.2 summarizes basic PETSc matrix operations. We briefly discuss a few of these routines in more
detail below.
The parallel matrix can multiply a vector with n local entries, returning a vector with m local entries. That
is, to form the product
VecCreateMPI(MPI_Comm comm,n,N,&x);
VecCreateMPI(MPI_Comm comm,m,M,&y);
By default, if the user lets PETSc decide the number of components to be stored locally (by passing in
PETSC_DECIDE as the second argument to VecCreateMPI() or using VecCreate()), vectors and ma-
trices of the same dimension are automatically compatible for parallel matrix-vector operations.
Along with the matrix-vector multiplication routine, there is a version for the transpose of the matrix,
There are also versions that add the result to another vector:
These routines, respectively, produce w = A ∗ x + y and w = AT ∗ x + y . In C it is legal for the vectors y and
w to be identical. In Fortran, this situation is forbidden by the language standard, but we allow it anyway.
One can print a matrix (sequential or parallel) to the screen with the command
MatView(Mat mat,PETSC_VIEWER_STDOUT_WORLD);
Other viewers can be used as well. For instance, one can draw the nonzero structure of the matrix into the
default X-window with the command
MatView(Mat mat,PETSC_VIEWER_DRAW_WORLD);
where viewer was obtained with PetscViewerDrawOpen(). Additional viewers and options are given
in the MatView() man page and Viewers: Looking at PETSc Objects.
Some people like to use matrix-free methods, which do not require explicit storage of the matrix, for the
numerical solution of partial differential equations. To support matrix-free methods in PETSc, one can use
the following command to create a Mat structure without ever actually generating the matrix:
Here M and N are the global matrix dimensions (rows and columns), m and n are the local matrix dimensions,
and ctx is a pointer to data needed by any user-defined shell matrix operations; the manual page has
2.2. Matrices 65
PETSc/TAO Users Manual, Release 3.19.2
additional details about these parameters. Most matrix-free algorithms require only the application of the
linear operator to a vector. To provide this action, the user must write a routine with the calling sequence
and then associate it with the matrix, mat, by using the command
Here MATOP_MULT is the name of the operation for matrix-vector multiplication. Within each user-defined
routine (such as UserMult()), the user should call MatShellGetContext() to obtain the user-defined
context, ctx, that was set by MatCreateShell(). This shell matrix can be used with the iterative linear
equation solvers discussed in the following chapters.
The routine MatShellSetOperation() can be used to set any other matrix operations as well. The
file $PETSC_DIR/include/petscmat.h (source) provides a complete list of matrix operations, which
have the form MATOP_<OPERATION>, where <OPERATION> is the name (in all capital letters) of the user
interface routine (for example, MatMult() → MATOP_MULT). All user-provided functions have the same
calling sequence as the usual matrix interface routines, since the user-defined functions are intended to be
accessed through the same interface, e.g., MatMult(Mat,Vec,Vec) → UserMult(Mat,Vec,Vec). The
final argument for MatShellSetOperation() needs to be cast to a void *, since the final argument
could (depending on the MatOperation) be a variety of different functions.
Note that MatShellSetOperation() can also be used as a “backdoor” means of introducing user-defined
changes in matrix operations for other storage formats (for example, to override the default LU factorization
routine supplied within PETSc for the MATSEQAIJ format). However, we urge anyone who introduces such
changes to use caution, since it would be very easy to accidentally create a bug in the new routine that could
affect other routines as well.
See also Matrix-Free Methods for details on one set of helpful utilities for using the matrix-free approach for
nonlinear solvers.
will either do an in-place or out-of-place matrix explicit formation of the matrix transpose. After it has been
called with MAT_INPLACE_MATRIX it may be called again with MAT_REUSE_MATRIX and it will recompute
the transpose if the A matrix has changed. Internally it keeps track of whether the nonzero pattern of A
has not changed so will reuse the symbolic transpose when possible for efficiency.
only does the symbolic transpose on the matrix. After it is called MatTranspose() may be called with
MAT_REUSE_MATRIX to compute the numerical transpose.
Occasionally one may already have a B matrix with the needed sparsity pattern to store
the transpose and wants to reuse that space instead of creating a new matrix by call-
ing MatTranspose(A,“MAT_INITIAL_MATRIX“,&B) but they cannot just call MatTrans-
pose(A,“MAT_REUSE_MATRIX“,&B) so instead they can call MatTransposeSetPrecusor(A,B) and
then call MatTranspose(A,“MAT_REUSE_MATRIX“,&B). This routine just provides to B the meta-data
it needs to compute the numerical factorization efficiently.
The routine MatCreateTranspose(A,&B) provides a surrogate matrix B that behaviors like the transpose
of A without forming the transpose explicitly. For example, MatMult(B,x,y) will compute the matrix-vector
product of A transpose times x.
In many iterative calculations (for instance, in a nonlinear equations solver), it is important for efficiency
purposes to reuse the nonzero structure of a matrix, rather than determining it anew every time the matrix
is generated. To retain a given matrix but reinitialize its contents, one can employ
MatZeroEntries(Mat A);
This routine will zero the matrix entries in the data structure but keep all the data that indicates where
the nonzeros are located. In this way a new matrix assembly will be much less expensive, since no memory
allocations or copies will be needed. Of course, one can also explicitly set selected matrix elements to zero
by calling MatSetValues().
By default, if new entries are made in locations where no nonzeros previously existed, space will be allocated
for the new entries. To prevent the allocation of additional memory and simply discard those new entries,
one can use the option
MatSetOption(Mat A,MAT_NEW_NONZERO_LOCATIONS,PETSC_FALSE);
Once the matrix has been assembled, one can factor it numerically without repeating the ordering or the
symbolic factorization. This option can save some computational time, although it does require that the
factorization is not done in-place.
In the numerical solution of elliptic partial differential equations, it can be cumbersome to deal with Dirichlet
boundary conditions. In particular, one would like to assemble the matrix without regard to boundary
conditions and then at the end apply the Dirichlet boundary conditions. In numerical analysis classes this
process is usually presented as moving the known boundary conditions to the right-hand side and then solving
a smaller linear system for the interior unknowns. Unfortunately, implementing this requires extracting a
large submatrix from the original matrix and creating its corresponding data structures. This process can
be expensive in terms of both time and memory.
One simple way to deal with this difficulty is to replace those rows in the matrix associated with known
boundary conditions, by rows of the identity matrix (or some scaling of it). This action can be done with
the command
or equivalently,
For sparse matrices this removes the data structures for certain rows of the matrix. If the pointer
diag_value is NULL, it even removes the diagonal entry. If the pointer is not null, it uses that given
value at the pointer location in the diagonal entry of the eliminated rows.
One nice feature of this approach is that when solving a nonlinear problem such that at each iteration the
Dirichlet boundary conditions are in the same positions and the matrix retains the same nonzero structure,
the user can call MatZeroRows() in the first iteration. Then, before generating the matrix in the second
iteration the user should call
2.2. Matrices 67
PETSc/TAO Users Manual, Release 3.19.2
MatSetOption(Mat A,MAT_NEW_NONZERO_LOCATIONS,PETSC_FALSE);
From that point, no new values will be inserted into those (boundary) rows of the matrix.
The functions MatZeroRowsLocal() and MatZeroRowsLocalIS() can also be used if for each process
one provides the Dirichlet locations in the local numbering of the matrix. A drawback of MatZeroRows()
is that it destroys the symmetry of a matrix. Thus one can use
or equivalently,
Note that with all of these for a given assembled matrix it can be only called once to update the x and b
vector. It cannot be used if one wishes to solve multiple right hand side problems for the same matrix since
the matrix entries needed for updating the b vector are removed in its first use.
Once the zeroed rows are removed the new matrix has possibly many rows with only a diagonal entry
affecting the parallel load balancing. The PCREDISTRIBUTE preconditioner removes all the zeroed rows
(and associated columns and adjusts the right hand side based on the removed columns) and then rebalances
the resulting rows of smaller matrix across the processes. Thus one can use MatZeroRows() to set the
Dirichlet points and then solve with the preconditioner PCREDISTRIBUTE. Note if the original matrix was
symmetric the smaller solved matrix will also be symmetric.
Another matrix routine of interest is
which converts the matrix mat to new matrix, M, that has either the same or different format. Set newtype
to MATSAME to copy the matrix, keeping the same matrix format. See $PETSC_DIR/include/petscmat.
h (source) for other available matrix types; standard ones are MATSEQDENSE, MATSEQAIJ, MATMPIAIJ,
MATSEQBAIJ and MATMPIBAIJ.
In certain applications it may be necessary for application codes to directly access elements of a matrix.
This may be done by using the the command (for local rows only)
The argument ncols returns the number of nonzeros in that row, while cols and vals returns the column
indices (with indices starting at zero) and values in the row. If only the column indices are needed (and not
the corresponding matrix elements), one can use NULL for the vals argument. Similarly, one can use NULL
for the cols argument. The user can only examine the values extracted with MatGetRow(); the values
cannot be altered. To change the matrix entries, one must use MatSetValues().
Once the user has finished using a row, he or she must call
to free any space that was allocated during the call to MatGetRow().
Many sparse matrix operations can be optimized by dividing the computation into two stages: a sym-
bolic stage that creates any required data structures and does all the computations that do not require
the matrices’ numerical values followed by one or more uses of a numerical stage that use the symbolically
computed information. Examples of such operations include MatTranspose(), MatCreateSubMatri-
ces(), MatCholeskyFactorSymbolic(), and MatCholeskyFactorNumeric(). PETSc uses two
different API’s to take advantage of these optimizations.
The first approach explicitly divides the computation in the API. This approach is used, for example, with
MatCholeskyFactorSymbolic(), MatCholeskyFactorNumeric(). The caller can take advantage
of their knowledge of changes in the nonzero structure of the sparse matrices to call the appropriate routines
as needed. In fact, they can use MatGetNonzeroState() to determine if a new symbolic computation is
needed. The drawback of this approach is that the caller of these routines has to manage the creation of
new matrices when the nonzero structure changes.
The second approach, as exemplified by MatTranspose(), does not expose the two stages explicit in the
API, instead a flag, MatReuse is passed through the API to indicate if a symbolic data structure is already
available or needs to be computed. Thus MatTranspose(A,MAT_INITIAL_MATRIX,&B) is called first,
then MatTranspose(A,MAT_REUSE_MATRIX,&B) can be called repeatedly with new numerical values in
the A matrix. In theory, if the nonzero structure of A changes, the symbolic computations for B could be
redone automatically inside the same B matrix when there is a change in the nonzero state of the A matrix. In
practice, in PETSc, the MAT_REUSE_MATRIX for most PETSc routines only works if the nonzero structure
does not change and the code may crash otherwise. The advantage of this approach (when the nonzero
structure changes are handled correctly) is that the calling code does not need to keep track of the nonzero
state of the matrices; everything “just works”. However, the caller must still know when it is the first call to
the routine so the flag MAT_INITIAL_MATRIX is being used. If the underlying implementation language
supported detecting a yet to be initialized variable at run time, the MatReuse flag would not be need.
PETSc uses two approaches because the same programming problem was solved with two different ways
during PETSc’s early development. A better model would combine both approaches; an explicit separation
of the stages and a unified operation that internally utilized the two stages appropriately and also handled
changes to the nonzero structure. Code could be simplified in many places with this approach, in most places
the use of the unified API would replace the use of the separate stages.
See Extracting Submatrices and Matrix-Matrix Products.
PETSc has four families of graph operations that treat sparse Mat as representing graphs.
2.2. Matrices 69
PETSc/TAO Users Manual, Release 3.19.2
2.2.10 Partitioning
For almost all unstructured grid computation, the distribution of portions of the grid across the process’s
work load and memory can have a very large impact on performance. In most PDE calculations the grid
partitioning and distribution across the processes can (and should) be done in a “pre-processing” step
before the numerical computations. However, this does not mean it need be done in a separate, sequential
program; rather, it should be done before one sets up the parallel grid data structures in the actual program.
PETSc provides an interface to the ParMETIS (developed by George Karypis; see the PETSc installation
instructions for directions on installing PETSc to use ParMETIS) to allow the partitioning to be done in
parallel. PETSc does not currently provide directly support for dynamic repartitioning, load balancing by
migrating matrix entries between processes, etc. For problems that require mesh refinement, PETSc uses the
“rebuild the data structure” approach, as opposed to the “maintain dynamic data structures that support
the insertion/deletion of additional vector and matrix rows and columns entries” approach.
Partitioning in PETSc is organized around the MatPartitioning object. One first creates a parallel
matrix that contains the connectivity information about the grid (or other graph-type object) that is to be
partitioned. This is done with the command
The argument mlocal indicates the number of rows of the graph being provided by the given process, n is
the total number of columns; equal to the sum of all the mlocal. The arguments ia and ja are the row
pointers and column pointers for the given rows; these are the usual format for parallel compressed sparse
row storage, using indices starting at 0, not 1.
1 4 0
2 1
0 3
5 2 3
Fig. 2.4: Numbering on Simple Unstructured Grid
This, of course, assumes that one has already distributed the grid (graph) information among the processes.
The details of this initial distribution is not important; it could be simply determined by assigning to the
first process the first n0 nodes from a file, the second process the next n1 nodes, etc.
For example, we demonstrate the form of the ia and ja for a triangular grid where we
(1) partition by element (triangle)
• Process 0: mlocal = 2, n = 4, ja ={2,3, 3}, ia = {0,2,3}
The resulting isg contains for each local node the new global number of that node. The resulting is
contains the new process number that each local node has been assigned to.
Now that a new numbering of the nodes has been determined, one must renumber all the nodes and migrate
the grid information to the correct process. The command
AOCreateBasicIS(isg,NULL,&ao);
generates, see Application Orderings, an AO object that can be used in conjunction with the is and isg to
move the relevant grid information to the correct process and renumber the nodes etc. In this context, the
new ordering is the “application” ordering so AOPetscToApplication() converts old global indices to
new global indices and AOApplicationToPetsc() converts new global indices back to old global indices.
PETSc does not currently provide tools that completely manage the migration and node renumbering, since
it will be dependent on the particular data structure you use to store the grid information and the type of
grid information that you need for your application. We do plan to include more support for this in the
future, but designing the appropriate general user interface and providing a scalable implementation that
can be used for a wide variety of different grids requires a great deal of time.
See Finite Difference Jacobian Approximations and Matrix Factorization for discussions on performing graph
coloring and computing graph reorderings to reduce fill in sparse matrix factorizations.
The KSP object is the heart of PETSc, because it provides uniform and efficient access to all of the package’s
linear system solvers, including parallel and sequential, direct and iterative. KSP is intended for solving
systems of the form
Ax = b, (2.1)
where A denotes the matrix representation of a linear operator, b is the right-hand-side vector, and x is the
solution vector. KSP uses the same calling sequence for both direct and iterative solution of a linear system.
In addition, particular solution techniques and their associated options can be selected at runtime.
The combination of a Krylov subspace method and a preconditioner is at the center of most modern numerical
codes for the iterative solution of linear systems. Many textbooks (e.g. [FGN92] [vdV03], or [Saa03]) provide
an overview of the theory of such methods. The KSP package, discussed in Krylov Methods, provides many
popular Krylov subspace iterative methods; the PC module, described in Preconditioners, includes a variety
of preconditioners.
To solve a linear system with KSP, one must first create a solver context with the command
Here comm is the MPI communicator and ksp is the newly formed solver context. Before actually solving
a linear system with KSP, the user must call the following routine to set the matrices associated with the
linear system:
The argument Amat, representing the matrix that defines the linear system, is a symbolic placeholder
for any kind of matrix or operator. In particular, KSP does support matrix-free methods. The routine
MatCreateShell() in Matrix-Free Matrices provides further information regarding matrix-free methods.
Typically, the matrix from which the preconditioner is to be constructed, Pmat, is the same as the matrix
that defines the linear system, Amat; however, occasionally these matrices differ (for instance, when a
preconditioning matrix is obtained from a lower order method than that employed to form the linear system
matrix).
Much of the power of KSP can be accessed through the single routine
KSPSetFromOptions(KSP ksp);
This routine accepts the option -help as well as any of the KSP and PC options discussed below. To solve
a linear system, one sets the right hand size and solution vectors using the command
where b and x respectively denote the right-hand-side and solution vectors. On return, the iteration number
at which the iterative process stopped can be obtained using
Note that this does not state that the method converged at this iteration: it can also have reached the
maximum number of iterations, or have diverged.
Convergence Tests gives more details regarding convergence testing. Note that multiple linear solves can be
performed by the same KSP context. Once the KSP context is no longer needed, it should be destroyed with
the command
KSPDestroy(KSP *ksp);
The above procedure is sufficient for general use of the KSP package. One additional step is required for
users who wish to customize certain preconditioners (e.g., see Block Jacobi and Overlapping Additive Schwarz
Preconditioners) or to log certain performance data using the PETSc profiling facilities (as discussed in
Profiling). In this case, the user can optionally explicitly call
KSPSetUp(KSP ksp);
before calling KSPSolve() to perform any setup required for the linear solvers. The explicit call of this
routine enables the separate monitoring of any computations performed during the set up phase, such as
incomplete factorization for the ILU preconditioner.
The default solver within KSP is restarted GMRES, preconditioned for the uniprocess case with ILU(0), and
for the multiprocess case with the block Jacobi method (with one block per process, each of which is solved
with ILU(0)). A variety of other solvers and options are also available. To allow application programmers to
set any of the preconditioner or Krylov subspace options directly within the code, we provide routines that
extract the PC and KSP contexts,
The application programmer can then directly call any of the PC or KSP routines to modify the corresponding
default options.
To solve a linear system with a direct solver (currently supported by PETSc for sequential matrices, and
by several external solvers through PETSc interfaces, see Using External Linear Solvers) one may use the
options -ksp_type preonly (or the equivalent -ksp_type none) -pc_type lu (see below).
By default, if a direct solver is used, the factorization is not done in-place. This approach prevents the user
from the unexpected surprise of having a corrupted matrix after a linear solve. The routine PCFactorSe-
tUseInPlace(), discussed below, causes factorization to be done in-place.
When solving multiple linear systems of the same size with the same method, several options are available.
To solve successive linear systems having the same preconditioner matrix (i.e., the same data structure
with exactly the same matrix elements) but different right-hand-side vectors, the user should simply call
KSPSolve() multiple times. The preconditioner setup operations (e.g., factorization for ILU) will be done
during the first call to KSPSolve() only; such operations will not be repeated for successive solves.
To solve successive linear systems that have different preconditioner matrices (i.e., the matrix elements
and/or the matrix data structure change), the user must call KSPSetOperators() and KSPSolve() for
each solve.
The Krylov subspace methods accept a number of options, many of which are discussed below. First, to set
the Krylov subspace method that is to be used, one calls the command
The type can be one of KSPRICHARDSON, KSPCHEBYSHEV, KSPCG, KSPGMRES, KSPTCQMR, KSPBCGS,
KSPCGS, KSPTFQMR, KSPCR, KSPLSQR, KSPBICG, KSPPREONLY (or the equivalent KSPNONE), or others;
see KSP Objects or the KSPType man page for more. The KSP method can also be set with the options
database command -ksp_type, followed by one of the options richardson, chebyshev, cg, gmres,
tcqmr, bcgs, cgs, tfqmr, cr, lsqr, bicg, preonly (or the equivalent none), or others (see KSP
Objects or the KSPType man page). There are method-specific options. For instance, for the Richardson,
Chebyshev, and GMRES methods:
The default parameter values are scale=1.0, emax=0.01, emin=100.0, and max_steps=30. The
GMRES restart and Richardson damping factor can also be set with the options -ksp_gmres_restart
<n> and -ksp_richardson_scale <factor>.
The default technique for orthogonalization of the Krylov vectors in GMRES is the unmodified (classical)
Gram-Schmidt method, which can be set with
KSPGMRESSetOrthogonalization(KSP ksp,KSPGMRESClassicalGramSchmidtOrthogonalization);
or the options database command -ksp_gmres_classicalgramschmidt. By default this will not use
iterative refinement to improve the stability of the orthogonalization. This can be changed with the option
KSPGMRESSetCGSRefinementType(KSP ksp,KSPGMRESCGSRefinementType type)
Since the rate of convergence of Krylov projection methods for a particular linear system is strongly dependent
on its spectrum, preconditioning is typically used to alter the spectrum and hence accelerate the convergence
rate of iterative techniques. Preconditioning can be applied to the system (2.1) by
where ML and MR indicate preconditioning matrices (or, matrices from which the preconditioner is to be
constructed). If ML = I in (2.2), right preconditioning results, and the residual of (2.1),
r ≡ b − Ax = b − AMR−1 MR x,
is preserved. In contrast, the residual is altered for left (MR = I) and symmetric preconditioning, as given
by
By default, most KSP implementations use left preconditioning. Some more naturally use other options,
though. For instance, KSPQCG defaults to use symmetric preconditioning and KSPFGMRES uses right pre-
conditioning by default. Right preconditioning can be activated for some methods by using the options
database command -ksp_pc_side right or calling the routine
KSPSetPCSide(ksp,PC_RIGHT);
Attempting to use right preconditioning for a method that does not currently support it results in an error
message of the form
We summarize the defaults for the residuals used in KSP convergence monitoring within KSP Objects. Details
regarding specific convergence tests and monitoring routines are presented in the following sections. The
preconditioned residual is used by default for convergence testing of all left-preconditioned KSP methods.
For the conjugate gradient, Richardson, and Chebyshev methods the true residual can be used by the options
database command -ksp_norm_type unpreconditioned or by calling the routine
KSPSetNormType(ksp,KSP_NORM_UNPRECONDITIONED);
Note: the bi-conjugate gradient method requires application of both the matrix and its transpose plus the
preconditioner and its transpose. Currently not all matrices and preconditioners provide this support and
thus the KSPBICG cannot always be used.
Note: PETSc implements the FETI-DP (Finite Element Tearing and Interconnecting Dual-Primal) method
as an implementation of KSP since it recasts the original problem into a constrained minimization one with
Lagrange multipliers. The only matrix type supported is MATIS. Support for saddle point problems is
provided. See the man page for KSPFETIDP for further details.
Convergence Tests
The default convergence test, KSPConvergedDefault(), is based on the l2 -norm of the residual. Conver-
gence (or divergence) is decided by three quantities: the decrease of the residual norm relative to the norm
of the right hand side, rtol, the absolute size of the residual norm, atol, and the relative increase in the
residual, dtol. Convergence is detected at iteration k if
These parameters, as well as the maximum number of allowable iterations, can be set with the routine
The user can retain the default value of any of these parameters by specifying PETSC_DEFAULT as the
corresponding tolerance; the defaults are rtol=1e-5, atol=1e-50, dtol=1e5, and maxits=1e4. These
parameters can also be set from the options database with the commands -ksp_rtol <rtol>, -ksp_atol
<atol>, -ksp_divtol <dtol>, and -ksp_max_it <its>.
In addition to providing an interface to a simple convergence test, KSP allows the application programmer
the flexibility to provide customized convergence-testing routines. The user can specify a customized routine
with the command
,→(*destroy)(void *ctx));
The final routine argument, ctx, is an optional context for private data for the user-defined convergence
routine, test. Other test routine arguments are the iteration number, it, and the residual’s l2 norm,
rnorm. The routine for detecting convergence, test, should set reason to positive for convergence,
0 for no convergence, and negative for failure to converge. A full list of possible values is given in the
KSPConvergedReason manual page. You can use KSPGetConvergedReason() after KSPSolve() to
see why convergence/divergence was detected.
Convergence Monitoring
By default, the Krylov solvers run silently without displaying information about the iterations. The user can
indicate that the norms of the residuals should be displayed by using -ksp_monitor within the options
database. To display the residual norms in a graphical window (running under X Windows), one should
use -ksp_monitor draw::draw_lg. Application programmers can also provide their own routines to
perform the monitoring by using the command
The final routine argument, ctx, is an optional context for private data for the user-defined monitoring rou-
tine, mon. Other mon routine arguments are the iteration number (it) and the residual’s l2 norm (rnorm). A
helpful routine within user-defined monitors is PetscObjectGetComm((PetscObject)ksp,MPI_Comm
*comm), which returns in comm the MPI communicator for the KSP context. See Writing PETSc Programs
for more discussion of the use of MPI communicators within PETSc.
Several monitoring routines are supplied with PETSc, including
The default monitor simply prints an estimate of the l2 -norm of the residual at each iteration. The routine
KSPMonitorSingularValue() is appropriate only for use with the conjugate gradient method or GM-
RES, since it prints estimates of the extreme singular values of the preconditioned operator at each iteration.
Since KSPMonitorTrueResidual() prints the true residual at each iteration by actually computing the
residual using the formula r = b − Ax, the routine is slow and should be used only for testing or convergence
studies, not for timing. These monitors may be accessed with the command line options -ksp_monitor,
-ksp_monitor_singular_value, and -ksp_monitor_true_residual.
To employ the default graphical monitor, one should use the command -ksp_monitor draw::draw_lg.
One can cancel hardwired monitoring routines for KSP at runtime with -ksp_monitor_cancel.
Unless the Krylov method converges so that the residual norm is small, say 10−10 , many of the final digits
printed with the -ksp_monitor option are meaningless. Worse, they are different on different machines;
due to different round-off rules used by, say, the IBM RS6000 and the Sun SPARC. This makes testing
between different machines difficult. The option -ksp_monitor_short causes PETSc to print fewer of
the digits of the residual norm as it gets smaller; thus on most of the machines it will always print the same
numbers making cross system testing easier.
Since the convergence of Krylov subspace methods depends strongly on the spectrum (eigenvalues) of the
preconditioned operator, PETSc has specific routines for eigenvalue approximation via the Arnoldi or Lanczos
iteration. First, before the linear solve one must call
KSPSetComputeEigenvalues(ksp,PETSC_TRUE);
Here, n is the size of the two arrays and the eigenvalues are inserted into those two arrays. neig is the
number of eigenvalues computed; this number depends on the size of the Krylov space generated during the
linear system solution, for GMRES it is never larger than the restart parameter. There is an additional
routine
that is useful only for very small problems. It explicitly computes the full representation of the preconditioned
operator and calls LAPACK to compute its eigenvalues. It should be only used for matrices of size up to a
couple hundred. The PetscDrawSP*() routines are very useful for drawing scatter plots of the eigenvalues.
The eigenvalues may also be computed and displayed graphically with the options data base
commands -ksp_view_eigenvalues draw and -ksp_view_eigenvalues_explicit draw.
Or they can be dumped to the screen in ASCII text via -ksp_view_eigenvalues and
-ksp_view_eigenvalues_explicit.
Standard Krylov methods require that the preconditioner be a linear operator, thus, for example, a standard
KSP method cannot use a KSP in its preconditioner, as is common in the Block-Jacobi method PCBJACOBI,
for example. Flexible Krylov methods are a subset of methods that allow (with modest additional require-
ments on memory) the preconditioner to be nonlinear. For example, they can be used with the PCKSP
preconditioner. The flexible KSP methods have the label “Flexible” in KSP Objects.
One can use KSPMonitorDynamicTolerance() to control the tolerances used by inner KSP solvers in
PCKSP, PCBJACOBI, and PCDEFLATION.
In addition to supporting PCKSP, the flexible methods support KSP*SetModifyPC(), for example,
KSPFGMRESSetModifyPC(), these functions allow the user to provide a callback function that changes
the preconditioner at each Krylov iteration. Its calling sequence is as follows.
Standard Krylov methods have one or more global reductions resulting from the computations of inner
products or norms in each iteration. These reductions need to block until all MPI ranks have received the
results. For a large number of MPI ranks (this number is machine dependent but can be above 10,000 ranks)
this synchronization is very time consuming and can significantly slow the computation. Pipelined Krylov
methods overlap the reduction operations with local computations (generally the application of the matrix-
vector products and precondtiioners) thus effectively “hiding” the time of the reductions. In addition, they
may reduce the number of global synchronizations by rearranging the computations in a way that some of
them can be collapsed, e.g., two or more calls to MPI_Allreduce() may be combined into one call. The
pipeline KSP methods have the label “Pipeline” in KSP Objects.
Special configuration of MPI may be necessary for reductions to make asynchronous progress, which is
important for performance of pipelined methods. See doc_faq_pipelined for details.
To obtain the solution vector and right hand side from a KSP context, one uses
During the iterative process the solution may not yet have been calculated or it may be stored in a different
location. To access the approximate solution during the iterative process, one uses the command
where the solution is returned in v. The user can optionally provide a vector in w as the location to store
the vector; however, if w is NULL, space allocated by PETSc in the KSP context is used. One should not
destroy this vector. For certain KSP methods (e.g., GMRES), the construction of the solution is expensive,
while for many others it doesn’t even require a vector copy.
Access to the residual is done in a similar way with the command
Again, for GMRES and certain other methods this is an expensive operation.
2.3.4 Preconditioners
As discussed in Preconditioning within KSP, Krylov subspace methods are typically used in conjunction
with a preconditioner. To employ a particular preconditioning method, the user can either select it from the
options database using input of the form -pc_type <methodname> or set the method with the command
In PETSc Preconditioners (partial list) we summarize the basic preconditioning methods supported in
PETSc. See the PCType manual page for a complete list. The PCSHELL preconditioner uses a specific,
application-provided preconditioner. The direct preconditioner, PCLU , is, in fact, a direct solver for the lin-
ear system that uses LU factorization. PCLU is included as a preconditioner so that PETSc has a consistent
interface among direct and iterative linear solvers.
Each preconditioner may have associated with it a set of options, which can be set with routines and
options database commands provided for this purpose. Such routine names and commands are all of the
form PC<TYPE><Option> and -pc_<type>_<option> [value]. A complete list can be found by
consulting the PCType manual page; we discuss just a few in the sections below.
When repeatedly solving linear systems with the same KSP context, one can reuse some information computed
during the first linear solve. In particular, PCFactorSetReuseOrdering() causes the ordering (for
example, set with -pc_factor_mat_ordering_type order) computed in the first factorization to be
reused for later factorizations. PCFactorSetUseInPlace() is often used with PCASM or PCBJACOBI
when zero fill is used, since it reuses the matrix space to store the incomplete factorization it saves memory
and copying time. Note that in-place factorization is not appropriate with any ordering besides natural and
cannot be used with the drop tolerance factorization. These options may be set in the database with
• -pc_factor_levels <levels>
• -pc_factor_reuse_ordering
• -pc_factor_reuse_fill
• -pc_factor_in_place
• -pc_factor_nonzeros_along_diagonal
• -pc_factor_diagonal_fill
See Memory Allocation for Sparse Matrix Factorization for information on preallocation of memory for
anticipated fill during factorization. By alleviating the considerable overhead for dynamic memory allocation,
such tuning can significantly enhance performance.
PETSc supports incomplete factorization preconditioners for several matrix types for sequential matrices
(for example MATSEQAIJ, MATSEQBAIJ, and MATSEQSBAIJ).
PETSc provides only a sequential SOR preconditioner; it can only be used with sequential matrices or as
the subblock preconditioner when using block Jacobi or ASM preconditioning (see below).
The options for SOR preconditioning with PCSOR are
The first of these commands sets the relaxation factor for successive over (under) relaxation. The
second command sets the number of inner iterations its and local iterations lits (the number
of smoothing sweeps on a process before doing a ghost point update from the other processes) to
use between steps of the Krylov space method. The total number of SOR sweeps is given by
its*lits. The third command sets the kind of SOR sweep, where the argument type can be
one of SOR_FORWARD_SWEEP, SOR_BACKWARD_SWEEP or SOR_SYMMETRIC_SWEEP, the default being
SOR_FORWARD_SWEEP. Setting the type to be SOR_SYMMETRIC_SWEEP produces the SSOR method. In
addition, each process can locally and independently perform the specified variant of SOR with the types
SOR_LOCAL_FORWARD_SWEEP, SOR_LOCAL_BACKWARD_SWEEP, and SOR_LOCAL_SYMMETRIC_SWEEP.
These variants can also be set with the options -pc_sor_omega <omega>, -pc_sor_its <its>,
-pc_sor_lits <lits>, -pc_sor_backward, -pc_sor_symmetric, -pc_sor_local_forward,
-pc_sor_local_backward, and -pc_sor_local_symmetric.
The Eisenstat trick [Eis81] for SSOR preconditioning can be employed with the method PCEISEN-
STAT (-pc_type eisenstat). By using both left and right preconditioning of the linear sys-
tem, this variant of SSOR requires about half of the floating-point operations for conventional SSOR.
The option -pc_eisenstat_no_diagonal_scaling (or the routine PCEisenstatSetNoDiago-
nalScaling()) turns off diagonal scaling in conjunction with Eisenstat SSOR method, while the op-
tion -pc_eisenstat_omega <omega> (or the routine PCEisenstatSetOmega(PC pc,PetscReal
omega)) sets the SSOR relaxation coefficient, omega, as discussed above.
LU Factorization
The LU preconditioner provides several options. The first, given by the command
causes the factorization to be performed in-place and hence destroys the original matrix. The options
database variant of this command is -pc_factor_in_place. Another direct preconditioner option is
selecting the ordering of equations with the command -pc_factor_mat_ordering_type <ordering>.
The possible orderings are
• MATORDERINGNATURAL - Natural
• MATORDERINGND - Nested Dissection
• MATORDERING1WD - One-way Dissection
• MATORDERINGRCM - Reverse Cuthill-McKee
The block Jacobi and overlapping additive Schwarz methods in PETSc are supported in parallel; however,
only the uniprocess version of the block Gauss-Seidel method is currently in place. By default, the PETSc
implementations of these methods employ ILU(0) factorization on each individual block (that is, the default
solver on each subblock is PCType=PCILU, KSPType=KSPPREONLY (or equivalently KSPType=KSPNONE);
the user can set alternative linear solvers via the options -sub_ksp_type and -sub_pc_type. In fact, all
of the KSP and PC options can be applied to the subproblems by inserting the prefix -sub_ at the beginning
of the option name. These options database commands set the particular options for all of the blocks within
the global problem. In addition, the routines
extract the KSP context for each local block. The argument n_local is the number of blocks on the calling
process, and first_local indicates the global number of the first block on the process. The blocks are
numbered successively by processes from zero through bg − 1, where bg is the number of global blocks. The
array of KSP contexts for the local blocks is given by subksp. This mechanism enables the user to set
different solvers for the various blocks. To set the appropriate data structures, the user must explicitly call
KSPSetUp() before calling PCBJacobiGetSubKSP() or PCASMGetSubKSP(). For further details, see
KSP Tutorial ex7 or KSP Tutorial ex8.
The block Jacobi, block Gauss-Seidel, and additive Schwarz preconditioners allow the user to set the num-
ber of blocks into which the problem is divided. The options database commands to set this value are
-pc_bjacobi_blocks n and -pc_bgs_blocks n, and, within a program, the corresponding routines
are
The optional argument size is an array indicating the size of each block. Currently, for certain parallel
matrix formats, only a single block per process is supported. However, the MATMPIAIJ and MATMPIBAIJ
formats support the use of general blocks as long as no blocks are shared among processes. The is argument
contains the index sets that define the subdomains.
The object PCASMType is one of PC_ASM_BASIC, PC_ASM_INTERPOLATE, PC_ASM_RESTRICT, or
PC_ASM_NONE and may also be set with the options database -pc_asm_type [basic, interpo-
late, restrict, none]. The type PC_ASM_BASIC (or -pc_asm_type basic) corresponds to the
standard additive Schwarz method that uses the full restriction and interpolation operators. The type
PC_ASM_RESTRICT (or -pc_asm_type restrict) uses a full restriction operator, but during the inter-
polation process ignores the off-process values. Similarly, PC_ASM_INTERPOLATE (or -pc_asm_type in-
terpolate) uses a limited restriction process in conjunction with a full interpolation, while PC_ASM_NONE
(or -pc_asm_type none) ignores off-process values for both restriction and interpolation. The ASM
types with limited restriction or interpolation were suggested by Xiao-Chuan Cai and Marcus Sarkis [CS97].
PC_ASM_RESTRICT is the PETSc default, as it saves substantial communication and for many problems has
the added benefit of requiring fewer iterations for convergence than the standard additive Schwarz method.
The user can also set the number of blocks and sizes on a per-process basis with the commands
For the ASM preconditioner one can use the following command to set the overlap to compute in constructing
the subdomains.
The overlap defaults to 1, so if one desires that no additional overlap be computed beyond what may
have been set with a call to PCASMSetTotalSubdomains() or PCASMSetLocalSubdomains(), then
overlap must be set to be 0. In particular, if one does not explicitly set the subdomains in an application
code, then all overlap would be computed internally by PETSc, and using an overlap of 0 would result in
an ASM variant that is equivalent to the block Jacobi preconditioner. Note that one can define initial index
sets is with any overlap via PCASMSetTotalSubdomains() or PCASMSetLocalSubdomains(); the
routine PCASMSetOverlap() merely allows PETSc to extend that overlap further if desired.
PCGASM is an experimental generalization of PCASM that allows the user to specify subdomains that span
multiple MPI ranks. This can be useful for problems where small subdomains result in poor convergence.
To be effective, the multirank subproblems must be solved using a sufficient strong subsolver, such as LU,
for which SuperLU_DIST or a similar parallel direct solver could be used; other choices may include a
multigrid solver on the subdomains.
The interface for PCGASM is similar to that of PCASM. In particular, PCGASMType is one of PC_GASM_BASIC,
PC_GASM_INTERPOLATE, PC_GASM_RESTRICT, PC_GASM_NONE. These options have the same meaning
as with PCASM and may also be set with the options database -pc_gasm_type [basic, interpolate,
restrict, none].
Unlike PCASM, however, PCGASM allows the user to define subdomains that span multiple MPI ranks. The
simplest way to do this is using a call to PCGASMSetTotalSubdomains(PC pc,PetscPetscInt N) with
the total number of subdomains N that is smaller than the MPI communicator size. In this case PCGASM
will coalesce size/N consecutive single-rank subdomains into a single multi-rank subdomain. The single-
rank subdomains contain the degrees of freedom corresponding to the locally-owned rows of the PCGASM
preconditioning matrix – these are the subdomains PCASM and PCGASM use by default.
Each of the multirank subdomain subproblems is defined on the subcommunicator that contains the coalesced
PCGASM ranks. In general this might not result in a very good subproblem if the single-rank problems
corresponding to the coalesced ranks are not very strongly connected. In the future this will be addressed
with a hierarchical partitioner that generates well-connected coarse subdomains first before subpartitioning
them into the single-rank subdomains.
In the meantime the user can provide his or her own multi-rank subdomains by calling
PCGASMSetSubdomains(PC,IS[],IS[]) where each of the IS objects on the list defines the inner
(without the overlap) or the outer (including the overlap) subdomain on the subcommunicator of the IS
object. A helper subroutine PCGASMCreateSubdomains2D() is similar to PCASM’s but is capable of con-
structing multi-rank subdomains that can be then used with PCGASMSetSubdomains(). An alternative
way of creating multi-rank subdomains is by using the underlying DM object, if it is capable of generating
such decompositions via DMCreateDomainDecomposition(). Ordinarily the decomposition specified
PETSc has a native algebraic multigrid preconditioner PCGAMG – gamg – and interfaces to three external
AMG packages: hypre, ML and AMGx (CUDA platforms only), that can be downloaded in the configuration
phase (eg, --download-hypre ) and used by specifiying that command line parameter (eg, -pc_type
hypre). Hypre is relatively monolithic in that a PETSc matrix is converted into a hypre matrix and then
hypre is called to do the entire solve. ML is more modular in that PETSc only has ML generate the coarse grid
spaces (columns of the prolongation operator), which is core of an AMG method, and then constructs a PCMG
with Galerkin coarse grid operator construction. GAMG is designed from the beginning to be modular, to
allow for new components to be added easily and also populates a multigrid preconditioner PCMG so generic
multigrid parameters are used. PETSc provides a fully supported (smoothed) aggregation AMG, (-pc_type
gamg -pc_gamg_type agg or PCSetType(pc,PCGAMG) and PCGAMGSetType(pc,PCGAMGAGG), as
well as reference implementations of a classical AMG method (-pc_gamg_type classical), a hybrid
geometric AMG method (-pc_gamg_type geo), and a 2.5D AMG method DofColumns [ISG15]. GAMG
does require the use of (MPI)AIJ matrices. For instance, BAIJ matrices are not supported. One can use
AIJ instead of BAIJ without changing any code other than the constructor (or the -mat_type from the
command line). For instance, MatSetValuesBlocked works with AIJ matrices.
GAMG provides unsmoothed aggregation (-pc_gamg_agg_nsmooths 0) and smoothed aggregation
(-pc_gamg_agg_nsmooths 1 or PCGAMGSetNSmooths(pc,1)). Smoothed aggregation (SA) is rec-
ommended for symmetric positive definite systems. Unsmoothed aggregation can be useful for asymmetric
problems and problems where highest eigen estimates are problematic. If poor convergence rates are observed
using the smoothed version one can test unsmoothed aggregation.
Eigenvalue estimates: The parameters for the KSP eigen estimator, used for SA, can be set with
-pc_gamg_esteig_ksp_max_it and -pc_gamg_esteig_ksp_type. For example CG generally con-
verges to the highest eigenvalue fast than GMRES (the default for KSP) if your problem is symmet-
ric positive definite. One can specify CG with -pc_gamg_esteig_ksp_type cg. The default for
-pc_gamg_esteig_ksp_max_it is 10, which we have found is pretty safe with a (default) safety fac-
tor of 1.1. One can specify the range of real eigenvalues, in the same way that one can for Chebyshev KSP
solvers (smoothers), with -pc_gamg_eigenvalues <emin,emax>. GAMG sets the MG smoother type
to chebyshev by default. By default, GAMG uses its eigen estimate, if it has one, for Chebyshev smoothers
if the smoother uses Jacobi preconditioning. This can be overridden with -pc_gamg_use_sa_esteig
<true,false>.
AMG methods requires knowledge of the number of degrees of freedom per vertex, the default is one (a
scalar problem). Vector problems like elasticity should set the block size of the matrix appropriately with
-mat_block_size bs or MatSetBlockSize(mat,bs). Equations must be ordered in “vertex-major”
ordering (e.g., x1 , y1 , z1 , x2 , y2 , ...).
Near null space: Smoothed aggregation requires an explicit representation of the (near) null space of
the operator for optimal performance. One can provide an orthonormal set of null space vectors with
MatSetNearNullSpace(). The vector of all ones is the default, for each variable given by the block size
(e.g., the translational rigid body modes). For elasticity, where rotational rigid body modes are required to
complete the near null space you can use MatNullSpaceCreateRigidBody() to create the null space
vectors and then MatSetNearNullSpace().
Coarse grid data model: The GAMG framework provides for reducing the number of active processes
on coarse grids to reduce communication costs when there is not enough parallelism to keep relative com-
munication costs down. Most AMG solver reduce to just one active process on the coarsest grid (the
PETSc MG framework also supports redundantly solving the coarse grid on all processes to potentially
reduce communication costs), although this forcing to one process can be overridden if one wishes to use
a parallel coarse grid solver. GAMG generalizes this by reducing the active number of processes on other
coarse grids as well. GAMG will select the number of active processors by fitting the desired number of
equation per process (set with -pc_gamg_process_eq_limit <50>,) at each level given that size of
each level. If Pi < P processors are desired on a level i then the first Pi ranks are populated with the
grid and the remaining are empty on that grid. One can, and probably should, repartition the coarse grids
with -pc_gamg_repartition <true>, otherwise an integer process reduction factor (q) is selected and
the equations on the first q processes are moved to process 0, and so on. As mentioned multigrid generally
coarsens the problem until it is small enough to be solved with an exact solver (eg, LU or SVD) in a relatively
small time. GAMG will stop coarsening when the number of equation on a grid falls below at threshold give
by -pc_gamg_coarse_eq_limit <50>,.
Coarse grid parameters: There are several options to provide parameters to the coarsening algorithm and
parallel data layout. Run a code that uses PCGAMG with -help to get full listing of GAMG parameters with
short parameter descriptions. The rate of coarsening is critical in AMG performance – too slow coarsening
will result in an overly expensive solver per iteration and too fast coarsening will result in decrease in the
convergence rate. -pc_gamg_threshold <-1> and -pc_gamg_aggressive_coarsening <N> are
the primary parameters that control coarsening rates, which is very important for AMG performance. A
greedy maximal independent set (MIS) algorithm is used in coarsening. Squaring the graph implements so
called MIS-2, the root vertex in an aggregate is more than two edges away from another root vertex, instead
of more than one in MIS. The threshold parameter sets a normalized threshold for which edges are removed
from the MIS graph, thereby coarsening slower. Zero will keep all non-zero edges, a negative number will
keep zero edges, a positive number will drop small edges. Typical finite threshold values are in the range of
0.01 − 0.05. There are additional parameters for changing the weights on coarse grids.
The parallel MIS algorithms requires symmetric weights/matrix. Thus PCGAMG will automatically make
the graph symmetric if it is not symmetric. Since this has additional cost users should indicate the symme-
try of the matrices they provide by calling MatSetOption``(mat,``MAT_SYMMETRIC,“PETSC_TRUE“
(or PETSC_FALSE)) or MatSetOption``(mat,``MAT_STRUCTURALLY_SYMMETRIC,“PETSC_TRUE“
(or PETSC_FALSE)) . If they know that the matrix will always have symmetry, despite future
changes to the matrix (with, for example, MatSetValues()) then they should also call MatSe-
tOption``(mat,``MAT_SYMMETRY_ETERNAL,“PETSC_TRUE“ (or PETSC_FALSE)) or MatSetOp-
tion``(mat,``MAT_STRUCTURAL_SYMMETRY_ETERNAL,“PETSC_TRUE“ (or PETSC_FALSE)). Using
this information allows the algorithm to skip the unnecessary computations.
Trouble shooting algebraic multigrid methods: If GAMG, ML, AMGx or hypre does not perform well
the first thing to try is one of the other methods. Often the default parameters or just the strengths of different
algorithms can fix performance problems or provide useful information to guide further debugging. There are
several sources of poor performance of AMG solvers and often special purpose methods must be developed to
achieve the full potential of multigrid. To name just a few sources of performance degradation that may not
be fixed with parameters in PETSc currently: non-elliptic operators, curl/curl operators, highly stretched
grids or highly anisotropic problems, large jumps in material coefficients with complex geometry (AMG
is particularly well suited to jumps in coefficients but it is not a perfect solution), highly incompressible
elasticity, not to mention ill-posed problems, and many others. For Grad-Div and Curl-Curl operators,
you may want to try the Auxiliary-space Maxwell Solver (AMS, -pc_type hypre -pc_hypre_type
ams) or the Auxiliary-space Divergence Solver (ADS, -pc_type hypre -pc_hypre_type ads) solvers.
These solvers need some additional information on the underlying mesh; specifically, AMS needs the discrete
gradient operator, which can be specified via PCHYPRESetDiscreteGradient(). In addition to the
discrete gradient, ADS also needs the specification of the discrete curl operator, which can be set using
PCHYPRESetDiscreteCurl().
I am converging slowly, what do I do? AMG methods are sensitive to coarsening rates and methods;
for GAMG use -pc_gamg_threshold <x> or PCGAMGSetThreshold() to regulate coarsening rates,
higher values decrease coarsening rate. Squaring the graph is the second mechanism for increasing coarsening
rate. Use -pc_gamg_aggressive_coarsening <N>, or PCGAMGSetAggressiveLevels(pc,N), to
aggressive ly coarsen (MIS-2) the graph on the finest N levels. A high threshold (e.g., x = 0.08) will result
in an expensive but potentially powerful preconditioner, and a low threshold (e.g., x = 0.0) will result in
faster coarsening, fewer levels, cheaper solves, and generally worse convergence rates.
One can run with -info and grep for “GAMG” to get some statistics on each level, which can be used to
see if you are coarsening at an appropriate rate. With smoothed aggregation you generally want to coarse
at about a rate of 3:1 in each dimension. Coarsening too slow will result in large numbers of non-zeros
per row on coarse grids (this is reported). The number of non-zeros can go up very high, say about 300
(times the degrees-of-freedom per vertex) on a 3D hex mesh. One can also look at the grid complexity,
which is also reported (the ratio of the total number of matrix entries for all levels to the number of matrix
entries on the fine level). Grid complexity should be well under 2.0 and preferably around 1.3 or lower. If
convergence is poor and the Galerkin coarse grid construction is much smaller than the time for each solve
then one can safely decrease the coarsening rate. -pc_gamg_threshold −1.0 is the simplest and most
robust option, and is recommended if poor convergence rates are observed, at least until the source of the
problem is discovered. In conclusion, if convergence is slow then decreasing the coarsening rate (increasing
the threshold) should be tried.
A note on Chebyshev smoothers. Chebyshev solvers are attractive as multigrid smoothers because they
can target a specific interval of the spectrum which is the purpose of a smoother. The spectral bounds for
Chebyshev solvers are simple to compute because they rely on the highest eigenvalue of your (diagonally
preconditioned) operator, which is conceptually simple to compute. However, if this highest eigenvalue
estimate is not accurate (too low) then the solvers can fail with and indefinite preconditioner message.
One can run with -info and grep for “GAMG” to get these estimates or use -ksp_view. These highest
eigenvalues are generally between 1.5-3.0. For symmetric positive definite systems CG is a better eigenvalue
estimator -mg_levels_esteig_ksp_type cg. Indefinite matrix messages are often caused by bad Eigen
estimates. Explicitly damped Jacobi or Krylov smoothers can provide an alternative to Chebyshev and hypre
has alternative smoothers.
Now am I solving alright, can I expect better? If you find that you are getting nearly one digit in
reduction of the residual per iteration and are using a modest number of point smoothing steps (e.g., 1-4
iterations of SOR), then you may be fairly close to textbook multigrid efficiency. Although you also need
to check the setup costs. This can be determined by running with -log_view and check that the time
for the Galerkin coarse grid construction (MatPtAP) is not (much) more than the time spent in each solve
(KSPSolve). If the MatPtAP time is too large then one can increase the coarsening rate by decreasing
the threshold and using aggressive coarsening (-pc_gamg_aggressive_coarsening <N>, squares the
graph on the finest N levels). Likewise if your MatPtAP time is small and your convergence rate is not ideal
then you could decrease the coarsening rate.
PETSc’s AMG solver is constructed as a framework for developers to easily add AMG capabilities, like a new
AMG methods or an AMG component like a matrix triple product. Contact us directly if you are interested
in contributing.
Adaptive Interpolation
Interpolation transfers a function from the coarse space to the fine space. We would like this process to
be accurate for the functions resolved by the coarse grid, in particular the approximate solution computed
there. By default, we create these matrices using local interpolation of the fine grid dual basis functions in the
coarse basis. However, an adaptive procedure can optimize the coefficients of the interpolator to reproduce
pairs of coarse/fine functions which should approximate the lowest modes of the generalized eigenproblem
Ax = λM x
where A is the system matrix and M is the smoother. Note that for defect-correction MG, the interpolated
solution from the coarse space need not be as accurate as the fine solution, for the same reason that updates
in iterative refinement can be less accurate. However, in FAS or in the final interpolation step for each level
of Full Multigrid, we must have interpolation as accurate as the fine solution since we are moving the entire
solution itself.
Injection should accurately transfer the fine solution to the coarse grid. Accuracy here means that the
action of a coarse dual function on either should produce approximately the same result. In the structured
grid case, this means that we just use the same values on coarse points. This can result in aliasing.
Restriction is intended to transfer the fine residual to the coarse space. Here we use averaging (often the
transpose of the interpolation operation) to damp out the fine space contributions. Thus, it is less accurate
than injection, but avoids aliasing of the high modes.
For a multigrid cycle, the interpolator P is intended to accurately reproduce “smooth” functions from the
coarse space in the fine space, keeping the energy of the interpolant about the same. For the Laplacian on
a structured mesh, it is easy to determine what these low-frequency functions are. They are the Fourier
modes. However an arbitrary operator A will have different coarse modes that we want to resolve accurately
on the fine grid, so that our coarse solve produces a good guess for the fine problem. How do we make sure
that our interpolator P can do this?
We first must decide what we mean by accurate interpolation of some functions. Suppose we know the
continuum function f that we care about, and we are only interested in a finite element description of
discrete functions. Then the coarse function representing f is given by
X
fC = fiC ϕC
i ,
i
Now we would like the interpolant of the coarse representer to the fine grid to be as close as possible to the
fine representer in a least squares sense, meaning we want to solve the minimization problem
min kf F − P f C k2
P
ϕFi f − ϕi P f
F F C
X
=fiF − Pij fjC
j
and we will treat each row of the interpolator as a separate optimization problem. We could allow an arbitrary
sparsity pattern, or try to determine adaptively, as is done in sparse approximate inverse preconditioning.
However, we know the supports of the basis functions in finite elements, and thus the naive sparsity pattern
from local interpolation can be used.
We note here that the BAMG framework of Brannick et al. [BBKL11] does not use fine and coarse functions
spaces, but rather a fine point/coarse point division which we will not employ here. Our general PETSc
routine should work for both since the input would be the checking set (fine basis coefficients or fine space
points) and the approximation set (coarse basis coefficients in the support or coarse points in the sparsity
pattern).
We can easily solve the above problem using QR factorization. However, there are many smooth functions
from the coarse space that we want interpolated accurately, and a single f would not constrain the values
Pij well. Therefore, we will use several functions {fk } in our minimization,
X X
min wk kfiF,k − Pij fjC,k k2
Pij
k j
X √ √ X
= min k wk fiF,k − wk Pij fjC,k k2
Pij
k j
= min kW 1/2 fF
i −W 1/2 C
f pi k2
Pij
where
w0
..
W = .
wK
fiF,0
.
fF
i = ..
fiF,K
C,0
f0 ··· fnC,0
C .. .. ..
f = . . .
f0C,K · · · fnC,K
Pi0
..
pi = .
Pin
or alternatively
kk = wk
[fiF ]k = fiF,k
[f C ]kj = fjC,k
[pi ]j = Pij
We thus have a standard least-squares problem
min kb − Axk2
Pij
where
A = W 1/2 f C
b = W 1/2 fiF
x = pi
|λl − λl+1 |
.
|λl |
This indicates that the generalized eigenvector associated with that eigenvalue was not adequately repre-
l
sented by Pl+1 , and the interpolator should be recomputed.
PETSc provides the Balancing Domain Decomposition by Constraints (BDDC) method for precondition-
ing parallel finite element problems stored in unassembled format (see MATIS). BDDC is a 2-level non-
overlapping domain decomposition method which can be easily adapted to different problems and discretiza-
tions by means of few user customizations. The application of the preconditioner to a vector consists in
the static condensation of the residual at the interior of the subdomains by means of local Dirichlet solves,
followed by an additive combination of Neumann local corrections and the solution of a global coupled coarse
problem. Command line options for the underlying KSP objects are prefixed by -pc_bddc_dirichlet,
-pc_bddc_neumann, and -pc_bddc_coarse respectively.
The current implementation supports any kind of linear system, and assumes a one-to-one mapping between
subdomains and MPI processes. Complex numbers are supported as well. For non-symmetric problems, use
the runtime option -pc_bddc_symmetric 0.
Unlike conventional non-overlapping methods that iterates just on the degrees of freedom at the in-
terface between subdomain, PCBDDC iterates on the whole set of degrees of freedom, allowing the
use of approximate subdomain solvers. When using approximate solvers, the command line switches
-pc_bddc_dirichlet_approximate and/or -pc_bddc_neumann_approximate should be used to
inform PCBDDC. If any of the local problems is singular, the nullspace of the local operator should be attached
to the local matrix via MatSetNullSpace().
At the basis of the method there’s the analysis of the connected components of the interface for the detection
of vertices, edges and faces equivalence classes. Additional information on the degrees of freedom can be
supplied to PCBDDC by using the following functions:
• PCBDDCSetDofsSplitting()
• PCBDDCSetLocalAdjacencyGraph()
• PCBDDCSetPrimalVerticesLocalIS()
• PCBDDCSetNeumannBoundaries()
• PCBDDCSetDirichletBoundaries()
• PCBDDCSetNeumannBoundariesLocal()
• PCBDDCSetDirichletBoundariesLocal()
Crucial for the convergence of the iterative process is the specification of the primal constraints to be im-
posed at the interface between subdomains. PCBDDC uses by default vertex continuities and edge arithmetic
averages, which are enough for the three-dimensional Poisson problem with constant coefficients. The user
can switch on and off the usage of vertices, edges or face constraints by using the command line switches
-pc_bddc_use_vertices, -pc_bddc_use_edges, -pc_bddc_use_faces. A customization of the
constraints is available by attaching a MatNullSpace object to the preconditioning matrix via MatSetN-
earNullSpace(). The vectors of the MatNullSpace object should represent the constraints in the form
of quadrature rules; quadrature rules for different classes of the interface can be listed in the same vector.
The number of vectors of the MatNullSpace object corresponds to the maximum number of constraints
that can be imposed for each class. Once all the quadrature rules for a given interface class have been
extracted, an SVD operation is performed to retain the non-singular modes. As an example, the rigid body
modes represent an effective choice for elasticity, even in the almost incompressible case. For particular
problems, e.g. edge-based discretization with Nedelec elements, a user defined change of basis of the degrees
of freedom can be beneficial for PCBDDC; use PCBDDCSetChangeOfBasisMat() to customize the change
of basis.
The BDDC method is usually robust with respect to jumps in the material parameters aligned with the
interface; for PDEs with more than one material parameter you may also consider to use the so-called deluxe
scaling, available via the command line switch -pc_bddc_use_deluxe_scaling. Other scalings are
available, see PCISSetSubdomainScalingFactor(), PCISSetSubdomainDiagonalScaling() or
PCISSetUseStiffnessScaling(). However, the convergence properties of the BDDC method degrades
in presence of large jumps in the material coefficients not aligned with the interface; for such cases, PETSc
has the capability of adaptively computing the primal constraints. Adaptive selection of constraints could be
requested by specifying a threshold value at command line by using -pc_bddc_adaptive_threshold x.
Valid values for the threshold x ranges from 1 to infinity, with smaller values corresponding to more robust
preconditioners. For SPD problems in 2D, or in 3D with only face degrees of freedom (like in the case of
Raviart-Thomas or Brezzi-Douglas-Marini elements), such a threshold is a very accurate estimator of the
condition number of the resulting preconditioned operator. Since the adaptive selection of constraints for
BDDC methods is still an active topic of research, its implementation is currently limited to SPD problems;
moreover, because the technique requires the explicit knowledge of the local Schur complements, it needs
the external package MUMPS.
When solving problems decomposed in thousands of subdomains or more, the solution of the BDDC coarse
problem could become a bottleneck; in order to overcome this issue, the user could either consider to solve
the parallel coarse problem on a subset of the communicator associated with PCBDDC by using the command
line switch -pc_bddc_coarse_redistribute, or instead use a multilevel approach. The latter can
be requested by specifying the number of requested level at command line (-pc_bddc_levels) or by
using PCBDDCSetLevels(). An additional parameter (see PCBDDCSetCoarseningRatio()) controls
the number of subdomains that will be generated at the next level; the larger the coarsening ratio, the lower
the number of coarser subdomains.
For further details, see the example KSP Tutorial ex59 and the online documentation for PCBDDC.
Shell Preconditioners
The shell preconditioner simply uses an application-provided routine to implement the preconditioner. To
set this routine, one uses the command
Often a preconditioner needs access to an application-provided data structured. For this, one should use
to retrieve it in apply. The three routine arguments of apply() are the PC, the input vector, and the
output vector, respectively.
For a preconditioner that requires some sort of “setup” before being used, that requires a new setup every
time the operator is changed, one can provide a routine that is called every time the operator is changed
(usually via KSPSetOperators()).
The argument to the setup routine is the same PC object which can be used to obtain the operators with
PCGetOperators() and the application-provided data structure that was set with PCShellSetCon-
text().
Combining Preconditioners
The PC type PCCOMPOSITE allows one to form new preconditioners by combining already-defined precondi-
tioners and solvers. Combining preconditioners usually requires some experimentation to find a combination
of preconditioners that works better than any single method. It is a tricky business and is not recommended
until your application code is complete and running and you are trying to improve performance. In many
cases using a single preconditioner is better than a combination; an exception is the multigrid/multilevel
preconditioners (solvers) that are always combinations of some sort, see Multigrid Preconditioners.
Let B1 and B2 represent the application of two preconditioners of type type1 and type2. The precondi-
tioner B = B1 + B2 can be obtained with
PCSetType(pc,PCCOMPOSITE);
PCCompositeAddPCType(pc,type1);
PCCompositeAddPCType(pc,type2);
In this form the new residual is updated after the application of each preconditioner and the next precondi-
tioner applied to the next residual. For example, with two composed preconditioners: B1 and B2 ; y = Bx
is obtained from
y = B1 x
w1 = x − Ay
y = y + B2 w1
Loosely, this corresponds to a Gauss-Seidel iteration, while additive corresponds to a Jacobi iteration.
Under most circumstances, the multiplicative form requires one-half the number of iterations as the additive
form; however, the multiplicative form does require the application of A inside the preconditioner.
In the multiplicative version, the calculation of the residual inside the preconditioner can be done in two
ways: using the original linear system matrix or using the matrix used to build the preconditioners B1 , B2 ,
etc. By default it uses the “preconditioner matrix”, to use the Amat matrix use the option
PCSetUseAmat(PC pc);
The individual preconditioners can be accessed (in order to set options) via
PCCompositeGetPC(PC pc,PetscInt count,PC *subpc);
One can also change the operator that is used to construct a particular PC in the composite PC call
PCSetOperators() on the obtained PC.
These various options can also be set via the options database. For example, -pc_type composite
-pc_composite_pcs jacobi,ilu causes the composite preconditioner to be used with two precondition-
ers: Jacobi and ILU. The option -pc_composite_type multiplicative initiates the multiplicative ver-
sion of the algorithm, while -pc_composite_type additive the additive version. Using the Amat matrix
is obtained with the option -pc_use_amat. One sets options for the sub-preconditioners with the extra pre-
fix -sub_N_ where N is the number of the sub-preconditioner. For example, -sub_0_pc_ifactor_fill
0.
PETSc also allows a preconditioner to be a complete linear solver. This is achieved with the PCKSP type.
PCSetType(PC pc,PCKSP PCKSP);
PCKSPGetKSP(pc,&ksp);
/* set any KSP/PC options */
From the command line one can use 5 iterations of biCG-stab with ILU(0) preconditioning as the precondi-
tioner with -pc_type ksp -ksp_pc_type ilu -ksp_ksp_max_it 5 -ksp_ksp_type bcgs.
By default the inner KSP solver uses the outer preconditioner matrix, Pmat, as the matrix to be solved in
the linear system; to use the matrix that defines the linear system, Amat use the option
PCSetUseAmat(PC pc);
Multigrid Preconditioners
A large suite of routines is available for using geometric multigrid as a preconditioner2 . In the PC framework,
the user is required to provide the coarse grid solver, smoothers, restriction and interpolation operators, and
code to calculate residuals. The PC package allows these components to be encapsulated within a PETSc-
compliant preconditioner. We fully support both matrix-free and matrix-based multigrid solvers.
A multigrid preconditioner is created with the four commands
KSPCreate(MPI_Comm comm,KSP *ksp);
KSPGetPC(KSP ksp,PC *pc);
PCSetType(PC pc,PCMG);
PCMGSetLevels(pc,PetscInt levels,MPI_Comm *comms);
with a value of PC_MG_CYCLE_W for ctype. The commands above can also be set from the options
database. The option names are -pc_mg_type [multiplicative, additive, full, kaskade],
and -pc_mg_cycle_type <ctype>.
The user can control the amount of smoothing by configuring the solvers on the levels. By default, the up
and down smoothers are identical. If separate configuration of up and down smooths is required, it can be
requested with the option -pc_mg_distinct_smoothup or the routine
PCMGSetDistinctSmoothUp(PC pc);
The multigrid routines, which determine the solvers and interpolation/restriction operators that are used,
are mandatory. To set the coarse grid solver, one must call
and set the appropriate options in ksp. Similarly, the smoothers are controlled by first calling
and then setting the various options in the ksp. For example,
PCMGGetSmoother(pc,1,&ksp);
KSPSetOperators(ksp,A1,A1);
sets the matrix that defines the smoother on level 1 of the multigrid. While
PCMGGetSmoother(pc,1,&ksp);
KSPGetPC(ksp,&pc);
PCSetType(pc,PCSOR);
Use
and
to define the intergrid transfer operations. If only one of these is set, its transpose will be used for the other.
It is possible for these interpolation operations to be matrix free (see Matrix-Free Matrices); One should
then make sure that these operations are defined for the (matrix-free) matrices passed in. Note that this
system is arranged so that if the interpolation is the transpose of the restriction, you can pass the same mat
argument to both PCMGSetRestriction() and PCMGSetInterpolation().
On each level except the coarsest, one must also set the routine to compute the residual. The following
command suffices:
The residual() function normally does not need to be set if one’s operator is stored in Mat format. In
certain circumstances, where it is much cheaper to calculate the residual directly, rather than through the
usual formula b − Ax, the user may wish to provide an alternative.
Finally, the user may provide three work vectors for each level (except on the finest, where only the residual
work vector is required). The work vectors are set with the commands
PCMGSetRhs(PC pc,PetscInt level,Vec b);
PCMGSetX(PC pc,PetscInt level,Vec x);
PCMGSetR(PC pc,PetscInt level,Vec r);
The PC references these vectors, so you should call VecDestroy() when you are finished with them. If
any of these vectors are not provided, the preconditioner will allocate them.
One can control the KSP and PC options used on the various levels (as well as the coarse grid) using the
prefix mg_levels_ (mg_coarse_ for the coarse grid). For example, -mg_levels_ksp_type cg will
cause the CG method to be used as the Krylov method for each level. Or -mg_levels_pc_type ilu
-mg_levels_pc_factor_levels 2 will cause the ILU preconditioner to be used on each level with two
levels of fill in the incomplete factorization.
Block matrices represent an important class of problems in numerical linear algebra and offer the possibility
of far more efficient iterative solvers than just treating the entire matrix as black box. In this section we use
the common linear algebra definition of block matrices where matrices are divided in a small, problem-size
independent (two, three or so) number of very large blocks. These blocks arise naturally from the underlying
physics or discretization of the problem, for example, the velocity and pressure. Under a certain numbering
of unknowns the matrix can be written as
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23 ,
A30 A31 A32 A33
where each Aij is an entire block. On a parallel computer the matrices are not explicitly stored this way.
Instead, each process will own some of the rows of A0∗ , A1∗ etc. On a process, the blocks may be stored one
block followed by another
A0000 A0001 A0002 ... A0100 A0102 ...
A0010 A0011 A0012 ... A0110 A0112 ...
A0020 A0021 A0022 ... A0120 A0122 ...
...
A1000 A1001 A1002 ... A1100 A1102 ...
A1010 A1011 A1012 ... A1110 A1112 ...
...
or interlaced, for example with two blocks
A0000 A0100 A0001 A0101 ...
A1000 A1100 A1001 A1101 ...
...
.
A0010 A0110 A0011 A0111 ...
A1010 A1110 A1011 A1111 ...
...
Note that for interlaced storage the number of rows/columns of each block must be the same size. Matrices
obtained with DMCreateMatrix() where the DM is a DMDA are always stored interlaced. Block matrices
can also be stored using the MATNEST format which holds separate assembled blocks. Each of these nested
matrices is itself distributed in parallel. It is more efficient to use MATNEST with the methods described in
this section because there are fewer copies and better formats (e.g. BAIJ or SBAIJ) can be used for the
components, but it is not possible to use many other methods with MATNEST. See Block Matrices for more
on assembling block matrices without depending on a specific matrix format.
The PETSc PCFIELDSPLIT preconditioner is used to implement the “block” solvers in PETSc. There
are three ways to provide the information that defines the blocks. If the matrices are stored as interlaced
then PCFieldSplitSetFields() can be called repeatedly to indicate which fields belong to each block.
More generally PCFieldSplitSetIS() can be used to indicate exactly which rows/columns of the matrix
belong to a particular block. You can provide names for each block with these routines, if you do not provide
names they are numbered from 0. With these two approaches the blocks may overlap (though generally they
will not). If only one block is defined then the complement of the matrices is used to define the other block.
Finally the option -pc_fieldsplit_detect_saddle_point causes two diagonal blocks to be found,
one associated with all rows/columns that have zeros on the diagonals and the rest.
For simplicity in the rest of the section we restrict our matrices to two by two blocks. So the matrix is
A00 A01
.
A10 A11
On occasion the user may provide another matrix that is used to construct parts of the preconditioner
Ap00 Ap01
.
Ap10 Ap11
For notational simplicity define ksp(A, Ap) to mean approximately solving a linear system using KSP with
operator A and preconditioner built from matrix Ap.
For matrices defined with any number of blocks there are three “block” algorithms available: block Jacobi,
ksp(A00 , Ap00 ) 0
0 ksp(A11 , Ap11 )
block Gauss-Seidel,
I 0 I 0 A−1
00 0
0 A−1
11 −A10 I 0 I
which is implemented3 as
I 0 0 0 I 0 I 0 ksp(A00 , Ap00 ) 0
+
0 ksp(A11 , Ap11 ) 0 I −A10 −A11 0 0 0 I
however, it is possible to use Amat instead of Pmat by calling PCSetUseAmat(pc) or using -pc_use_amat
on the command line. Alternatively, you can have PCFieldSplit extract the diagonal blocks A00 , A11 etc.
out of Amat by calling PCFieldSplitSetDiagUseAmat(pc,PETSC_TRUE) or supplying command-line
argument -pc_fieldsplit_diag_use_amat. Similarly, PCFieldSplitSetOffDiagUseAmat(pc,
{PETSC_TRUE) or -pc_fieldsplit_off_diag_use_amat will cause the off-diagonal blocks A01 , A10
etc. to be extracted out of Amat.
For two by two blocks only there are another family of solvers, based on Schur complements. The inverse of
the Schur complement factorization is
−1
I 0 A00 0 I A−1
00 A01
A10 A−1
00 I 0 S 0 I
−1 −1
I A−1
00 A01 A−1
00 0 I 0
0 I 0 S −1 A10 A−1
00 I
I −A−1
00 A01 A−1
00 0 I 0
0 I 0 S −1 −A10 A−1
00 I
−1
A−1
00 0 I −A01 A00 0 I 0 A00 0
.
0 I 0 I 0 S −1 −A10 I 0 I
The preconditioner is accessed with -pc_fieldsplit_type schur and is implemented as
ksp(A00 , Ap00 ) 0 I −A01 I 0 I 0
.
0 I 0 I 0 ksp(Ŝ, Ŝp) −A10 ksp(A00 , Ap00 ) I
Where Ŝ = A11 − A10 ksp(A00 , Ap00 )A01 is the approximate Schur complement.
There are several variants of the Schur complement preconditioner obtained by dropping some of the terms,
these can be obtained with -pc_fieldsplit_schur_fact_type <diag,lower,upper,full> or the
function PCFieldSplitSetSchurFactType(). Note that the diag form uses the preconditioner
ksp(A00 , Ap00 ) 0
.
0 −ksp(Ŝ, Ŝp)
This is done to ensure the preconditioner is positive definite for a common class of problems, saddle points
with a positive definite A00 : for these the Schur complement is negative definite.
The effectiveness of the Schur complement preconditioner depends on the availability of a good precon-
ditioner Ŝp for the Schur complement matrix. In general, you are responsible for supplying Ŝp via
PCFieldSplitSchurPrecondition(pc,PC_FIELDSPLIT_SCHUR_PRE_USER,Sp). In the absence
of a good problem-specific Ŝp, you can use some of the built-in options.
Using -pc_fieldsplit_schur_precondition user on the command line activates the matrix sup-
plied programmatically as explained above.
With -pc_fieldsplit_schur_precondition a11 (default) Ŝp = A11 is used to build a preconditioner
for Ŝ.
Otherwise, -pc_fieldsplit_schur_precondition self will set Ŝp = Ŝ and use the Schur comple-
ment matrix itself to build the preconditioner.
The problem with the last approach is that Ŝ is used in unassembled, matrix-free form, and many precondi-
tioners (e.g., ILU) cannot be built out of such matrices. Instead, you can assemble an approximation to Ŝ by
inverting A00 , but only approximately, so as to ensure the sparsity of Ŝp as much as possible. Specifically,
using -pc_fieldsplit_schur_precondition selfp will assemble Ŝp = A11 − A10 inv(A00 )A01 .
By default inv(A00 ) is the inverse of the diagonal of A00 , but using
-fieldsplit_1_mat_schur_complement_ainv_type lump will lump A00 first. Using
ksp(A10 A01 , A10 A01 )A10 A00 A01 ksp(A10 A01 , A10 A01 )
which, of course, introduces two additional inner solves for each application of the Schur complement. The
options prefix for this inner KSP is -fieldsplit_1_lsc_. Instead of constructing the matrix A10 A01 the
user can provide their own matrix. This is done by attaching the matrix/matrices to the Sp matrix they
provide with
PetscObjectCompose((PetscObject)Sp,"LSC_L",(PetscObject)L);
PetscObjectCompose((PetscObject)Sp,"LSC_Lp",(PetscObject)Lp);
Sometimes one is required to solver singular linear systems. In this case, the system matrix has a nontrivial
null space. For example, the discretization of the Laplacian operator with Neumann boundary conditions
has a null space of the constant functions. PETSc has tools to help solve these systems. This approach is
only guaranteed to work for left preconditioning (see KSPSetPCSide()); for example it may not work in
some situations with KSPFGMRES.
First, one must know what the null space is and store it using an orthonormal basis in an array of PETSc
Vecs. The constant functions can be handled separately, since they are such a common case. Create a
MatNullSpace object with the command
Here, dim is the number of vectors in basis and hasconstants indicates if the null space contains the
constant functions. If the null space contains the constant functions you do not need to include it in the
basis vectors you provide, nor in the count dim.
One then tells the KSP object you are using what the null space is with the call
The Amat should be the first matrix argument used with KSPSetOperators(), SNESSetJacobian(),
or TSSetIJacobian(). The PETSc solvers will now handle the null space during the solution process.
If the right hand side of linear system is not in the range of Amat, that is it is not orthogonal to the null space
of Amat transpose, then the residual norm of the Krylov iteration will not converge to zero; it will converge
to a non-zero value while the solution is converging to the least squares solution of the linear system. One
can, if one desires, apply MatNullSpaceRemove() with the null space of Amat transpose to the right
hand side before calling KSPSolve(). Then the residual norm will converge to zero.
If one chooses a direct solver (or an incomplete factorization) it may still detect a zero pivot. You can
run with the additional options or -pc_factor_shift_type NONZERO -pc_factor_shift_amount
<dampingfactor> to prevent the zero pivot. A good choice for the dampingfactor is 1.e-10.
If the matrix is non-symmetric and you wish to solve the transposed linear system you must provide the null
space of the transposed matrix with MatSetTransposeNullSpace().
PETSc interfaces to several external linear solvers (also see acknowledgements). To use these solvers,
one may:
1. Run configure with the additional options --download-packagename e.g.
--download-superlu_dist --download-parmetis (SuperLU_DIST needs ParMetis) or
--download-mumps --download-scalapack (MUMPS requires ScaLAPACK).
2. Build the PETSc libraries.
3. Use the runtime option: -ksp_type preonly (or equivalently -ksp_type none) -pc_type
<pctype> -pc_factor_mat_solver_type <packagename>. For eg: -ksp_type preonly
-pc_type lu -pc_factor_mat_solver_type superlu_dist.
sbaij cholesky
• •
aijcrl
• • •
aijperm
• • •
seqdense
• • •
aij
• • •
baij
• • •
aijcrl
• • •
aijperm
• • •
seqdense
• • •
The default and available input options for each external software can be found by specifying -help at
runtime.
As an alternative to using runtime flags to employ these external packages, procedural calls are provided for
some packages. For example, the following procedural calls are equivalent to runtime options -ksp_type
preonly (or equivalently -ksp_type none) -pc_type lu -pc_factor_mat_solver_type mumps
-mat_mumps_icntl_7 3:
One can also create matrices with the appropriate capabilities by calling MatCreate() followed by Mat-
SetType() specifying the desired matrix type from Options for External Solvers. These matrix types inherit
capabilities from their PETSc matrix parents: seqaij, mpiaij, etc. As a result, the preallocation rou-
tines MatSeqAIJSetPreallocation(), MatMPIAIJSetPreallocation(), etc. and any other type
specific routines of the base class are supported. One can also call MatConvert() inplace to convert the
matrix to and from its base class without performing an expensive data copy. MatConvert() cannot be
called on matrices that have already been factored.
In Options for External Solvers, the base class aij refers to the fact that inheritance is based on MATSEQAIJ
when constructed with a single process communicator, and from MATMPIAIJ otherwise. The same holds for
baij and sbaij. For codes that are intended to be run as both a single process or with multiple processes,
depending on the mpiexec command, it is recommended that both sets of preallocation routines are called
for these communicator morphing types. The call for the incorrect type will simply be ignored without any
harm or message.
Using PETSc’s MPI linear solver server it is possible to use multiple MPI processes to solve a a linear system
when the application code, including the matrix generation, is run on a single MPI rank (with or without
OpenMP). The application code must be built with MPI and must call PetscIntialize() at the very
beginning of the program and end with PetscFinalize(). The application code may utilize OpenMP.
The code may create multiple matrices and KSP objects and call KSPSolve(), similarly the code may utilize
the SNES nonlinear solvers, the TS ODE integrators, and the TAO optimization algorithms which use KSP.
Amdahl’s law makes clear that parallelizing only a portion of a numerical code can only provide a limited
improvement in the computation time; thus it is crucial to understand what phases of a computation must
be parallelized (via MPI, OpenMP, or some other model) to ensure a useful increase in performance. One
of the crucial phases is likely the generation of the matrix entries; the use of MatSetPreallocationCOO() and
MatSetValuesCOO() in an OpenMP code allows parallelizing the generation of the matrix.
The program must then be launched using the standard approaches for launching MPI programs with
the option -mpi_linear_solver_server and options to utilize the PCMPI preconditioners; for example, -
ksp_type preonly and pc_type mpi. Any standard solver options may be passed to the parallel solvers using
the options prefix -mpi_; for example, -mpi_ksp_type cg. The option -mpi_linear_solver_server_view will
print a summary of all the systems solved by the MPI linear solver server.
References
Note: This chapter is being cleaned up by Jed Brown. Contributions are welcome.
The solution of large-scale nonlinear problems pervades many facets of computational science and demands
robust and flexible solution strategies. The SNES library of PETSc provides a powerful suite of data-
structure-neutral numerical routines for such problems. Built on top of the linear solvers and data structures
discussed in preceding chapters, SNES enables the user to easily customize the nonlinear solvers according to
the application at hand. Also, the SNES interface is identical for the uniprocess and parallel cases; the only
difference in the parallel version is that each process typically forms only its local contribution to various
matrices and vectors.
The SNES class includes methods for solving systems of nonlinear equations of the form
F(x) = 0, (2.3)
where F : <n → <n . Newton-like methods provide the core of the package, including both line search and
trust region techniques. A suite of nonlinear Krylov methods and methods based upon problem decompo-
sition are also included. The solvers are discussed further in The Nonlinear Solvers. Following the PETSc
design philosophy, the interfaces to the various solvers are all virtually identical. In addition, the SNES
software is completely flexible, so that the user can at runtime change any facet of the solution process.
PETSc’s default method for solving the nonlinear equation is Newton’s method. The general form of the
n-dimensional Newton’s method for solving (2.3) is
where x0 is an initial approximation to the solution and J(xk ) = F′ (xk ), the Jacobian, is nonsingular at each
iteration. In practice, the Newton iteration (2.4) is implemented by the following two steps:
Other defect-correction algorithms can be implemented by using different choices for J(xk ).
In the simplest usage of the nonlinear solvers, the user must merely provide a C, C++, or Fortran routine
to evaluate the nonlinear function (2.3). The corresponding Jacobian matrix can be approximated with
finite differences. For codes that are typically more efficient and accurate, the user can provide a routine to
compute the Jacobian; details regarding these application-provided routines are discussed below. To provide
an overview of the use of the nonlinear solvers, browse the concrete example in ex1.c or skip ahead to the
discussion.
Listing: src/snes/tutorials/ex1.c
/*
Include "petscsnes.h" so that we can use SNES solvers. Note that this
file automatically includes:
petscsys.h - base PETSc routines petscvec.h - vectors
petscmat.h - matrices
petscis.h - index sets petscksp.h - Krylov subspace methods
petscviewer.h - viewers petscpc.h - preconditioners
petscksp.h - linear solvers
*/
/*F
This examples solves either
\begin{equation}
F\genfrac{(}{)}{0pt}{}{x_0}{x_1} = \genfrac{(}{)}{0pt}{}{x^2_0 + x_0 x_1 - 3}{x_0 x_
,→1 + x^2_1 - 6}
\end{equation}
or if the {\tt -hard} options is given
\begin{equation}
F\genfrac{(}{)}{0pt}{}{x_0}{x_1} = \genfrac{(}{)}{0pt}{}{\sin(3 x_0) + x_0}{x_1}
\end{equation}
F*/
#include <petscsnes.h>
/*
User-defined routines
*/
extern PetscErrorCode FormJacobian1(SNES, Vec, Mat, Mat, void *);
extern PetscErrorCode FormFunction1(SNES, Vec, Vec, void *);
extern PetscErrorCode FormJacobian2(SNES, Vec, Mat, Mat, void *);
(continues on next page)
PetscFunctionBeginUser;
PetscCall(PetscInitialize(&argc, &argv, (char *)0, help));
PetscCallMPI(MPI_Comm_size(PETSC_COMM_WORLD, &size));
PetscCheck(size == 1, PETSC_COMM_WORLD, PETSC_ERR_WRONG_MPI_SIZE, "Example is only␣
,→for sequential runs");
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create nonlinear solver context
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
PetscCall(SNESCreate(PETSC_COMM_WORLD, &snes));
PetscCall(SNESSetType(snes, SNESNEWTONLS));
PetscCall(SNESSetOptionsPrefix(snes, "mysolver_"));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create matrix and vector data structures; set corresponding routines
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create vectors for solution and nonlinear function
*/
PetscCall(VecCreate(PETSC_COMM_WORLD, &x));
PetscCall(VecSetSizes(x, PETSC_DECIDE, 2));
PetscCall(VecSetFromOptions(x));
PetscCall(VecDuplicate(x, &r));
/*
Create Jacobian matrix data structure
*/
PetscCall(MatCreate(PETSC_COMM_WORLD, &J));
PetscCall(MatSetSizes(J, PETSC_DECIDE, PETSC_DECIDE, 2, 2));
PetscCall(MatSetFromOptions(J));
PetscCall(MatSetUp(J));
/*
Set Jacobian matrix data structure and Jacobian evaluation routine
*/
PetscCall(SNESSetJacobian(snes, J, J, FormJacobian1, NULL));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Customize nonlinear solver; set runtime options
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Set linear solver defaults for this problem. By extracting the
KSP and PC contexts from the SNES context, we can then
directly call any KSP and PC routines to set various options.
*/
PetscCall(SNESGetKSP(snes, &ksp));
PetscCall(KSPGetPC(ksp, &pc));
PetscCall(PCSetType(pc, PCNONE));
PetscCall(KSPSetTolerances(ksp, 1.e-4, PETSC_DEFAULT, PETSC_DEFAULT, 20));
/*
Set SNES/KSP/KSP/PC runtime options, e.g.,
-snes_view -snes_monitor -ksp_type <ksp> -pc_type <pc>
These options will override those specified above as long as
SNESSetFromOptions() is called _after_ any other customization
routines.
*/
PetscCall(SNESSetFromOptions(snes));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Evaluate initial guess; then solve nonlinear system
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
if (!flg) {
PetscCall(VecSet(x, pfive));
} else {
PetscCall(VecGetArray(x, &xx));
xx[0] = 2.0;
xx[1] = 3.0;
PetscCall(VecRestoreArray(x, &xx));
}
/*
Note: The user should initialize the vector, x, with the initial guess
for the nonlinear solver prior to calling SNESSolve(). In particular,
to employ an initial guess of zero, the user should explicitly set
this vector to zero by calling VecSet().
*/
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Free work space. All PETSc objects should be destroyed when they
are no longer needed.
PetscCall(VecDestroy(&x));
PetscCall(VecDestroy(&r));
PetscCall(MatDestroy(&J));
PetscCall(SNESDestroy(&snes));
PetscCall(PetscFinalize());
return 0;
}
/* ------------------------------------------------------------------- */
/*
FormFunction1 - Evaluates nonlinear function, F(x).
Input Parameters:
. snes - the SNES context
. x - input vector
. ctx - optional user-defined context
Output Parameter:
. f - function vector
*/
PetscErrorCode FormFunction1(SNES snes, Vec x, Vec f, void *ctx)
{
const PetscScalar *xx;
PetscScalar *ff;
PetscFunctionBeginUser;
/*
Get pointers to vector data.
- For default PETSc vectors, VecGetArray() returns a pointer to
the data array. Otherwise, the routine is implementation dependent.
- You MUST call VecRestoreArray() when you no longer need access to
the array.
*/
PetscCall(VecGetArrayRead(x, &xx));
PetscCall(VecGetArray(f, &ff));
/* Compute function */
ff[0] = xx[0] * xx[0] + xx[0] * xx[1] - 3.0;
ff[1] = xx[0] * xx[1] + xx[1] * xx[1] - 6.0;
/* Restore vectors */
PetscCall(VecRestoreArrayRead(x, &xx));
PetscCall(VecRestoreArray(f, &ff));
PetscFunctionReturn(PETSC_SUCCESS);
}
/* ------------------------------------------------------------------- */
/*
FormJacobian1 - Evaluates Jacobian matrix.
Input Parameters:
. snes - the SNES context
. x - input vector
. dummy - optional user-defined context (not used here)
Output Parameters:
PetscFunctionBeginUser;
/*
Get pointer to vector data
*/
PetscCall(VecGetArrayRead(x, &xx));
/*
Compute Jacobian entries and insert into matrix.
- Since this is such a small problem, we set all entries for
the matrix at once.
*/
A[0] = 2.0 * xx[0] + xx[1];
A[1] = xx[0];
A[2] = xx[1];
A[3] = xx[0] + 2.0 * xx[1];
PetscCall(MatSetValues(B, 2, idx, 2, idx, A, INSERT_VALUES));
/*
Restore vector
*/
PetscCall(VecRestoreArrayRead(x, &xx));
/*
Assemble matrix
*/
PetscCall(MatAssemblyBegin(B, MAT_FINAL_ASSEMBLY));
PetscCall(MatAssemblyEnd(B, MAT_FINAL_ASSEMBLY));
if (jac != B) {
PetscCall(MatAssemblyBegin(jac, MAT_FINAL_ASSEMBLY));
PetscCall(MatAssemblyEnd(jac, MAT_FINAL_ASSEMBLY));
}
PetscFunctionReturn(PETSC_SUCCESS);
}
/* ------------------------------------------------------------------- */
PetscErrorCode FormFunction2(SNES snes, Vec x, Vec f, void *dummy)
{
const PetscScalar *xx;
PetscScalar *ff;
PetscFunctionBeginUser;
/*
Get pointers to vector data.
- For default PETSc vectors, VecGetArray() returns a pointer to
the data array. Otherwise, the routine is implementation dependent.
- You MUST call VecRestoreArray() when you no longer need access to
/*
Compute function
*/
ff[0] = PetscSinScalar(3.0 * xx[0]) + xx[0];
ff[1] = xx[1];
/*
Restore vectors
*/
PetscCall(VecRestoreArrayRead(x, &xx));
PetscCall(VecRestoreArray(f, &ff));
PetscFunctionReturn(PETSC_SUCCESS);
}
/* ------------------------------------------------------------------- */
PetscErrorCode FormJacobian2(SNES snes, Vec x, Mat jac, Mat B, void *dummy)
{
const PetscScalar *xx;
PetscScalar A[4];
PetscInt idx[2] = {0, 1};
PetscFunctionBeginUser;
/*
Get pointer to vector data
*/
PetscCall(VecGetArrayRead(x, &xx));
/*
Compute Jacobian entries and insert into matrix.
- Since this is such a small problem, we set all entries for
the matrix at once.
*/
A[0] = 3.0 * PetscCosScalar(3.0 * xx[0]) + 1.0;
A[1] = 0.0;
A[2] = 0.0;
A[3] = 1.0;
PetscCall(MatSetValues(B, 2, idx, 2, idx, A, INSERT_VALUES));
/*
Restore vector
*/
PetscCall(VecRestoreArrayRead(x, &xx));
/*
Assemble matrix
*/
PetscCall(MatAssemblyBegin(B, MAT_FINAL_ASSEMBLY));
PetscCall(MatAssemblyEnd(B, MAT_FINAL_ASSEMBLY));
if (jac != B) {
PetscCall(MatAssemblyBegin(jac, MAT_FINAL_ASSEMBLY));
PetscCall(MatAssemblyEnd(jac, MAT_FINAL_ASSEMBLY));
}
The user must then set routines for evaluating the residual function (2.3) and its associated Jacobian matrix,
as discussed in the following sections.
To choose a nonlinear solution method, the user can either call
or use the option -snes_type <method>, where details regarding the available methods are presented in
The Nonlinear Solvers. The application code can take complete control of the linear and nonlinear techniques
used in the Newton-like method by calling
SNESSetFromOptions(snes);
This routine provides an interface to the PETSc options database, so that at runtime the user can select
a particular nonlinear solver, set various parameters and customized routines (e.g., specialized line search
variants), prescribe the convergence tolerance, and set monitoring routines. With this routine the user can
also control all linear solver options in the KSP, and PC modules, as discussed in KSP: Linear System Solvers.
After having set these routines and options, the user solves the problem by calling
where x should be initialized to the initial guess before calling and contains the solution on return. In
particular, to employ an initial guess of zero, the user should explicitly set this vector to zero by calling
VecZeroEntries(x). Finally, after solving the nonlinear system (or several systems), the user should
destroy the SNES context with
SNESDestroy(SNES *snes);
When solving a system of nonlinear equations, the user must provide a a residual function (2.3), which is
set using
The argument f is an optional vector for storing the solution; pass NULL to have the SNES allocate it for
you. The argument ctx is an optional user-defined context, which can store any private, application-specific
data required by the function evaluation routine; NULL should be used if such information is not needed. In
C and C++, a user-defined context is merely a structure in which various objects can be stashed; in Fortran
a user context can be an integer array that contains both parameters and pointers to PETSc objects. SNES
Tutorial ex5 and SNES Tutorial ex5f90 give examples of user-defined application contexts in C and Fortran,
respectively.
Jacobian Evaluation
The user must also specify a routine to form some approximation of the Jacobian matrix, A, at the current
iterate, x, as is typically done with
SNESSetJacobian(SNES snes,Mat Amat,Mat Pmat,PetscErrorCode (*FormJacobian)(SNES snes,
,→Vec x,Mat A,Mat B,void *ctx),void *ctx);
The arguments of the routine FormJacobian() are the current iterate, x; the (approximate) Jacobian
matrix, Amat; the matrix from which the preconditioner is constructed, Pmat (which is usually the same
as Amat); and an optional user-defined Jacobian context, ctx, for application-specific data. Note that the
SNES solvers are all data-structure neutral, so the full range of PETSc matrix formats (including “matrix-
free” methods) can be used. Matrices discusses information regarding available matrix formats and options,
while Matrix-Free Methods focuses on matrix-free methods in SNES. We briefly touch on a few details of
matrix usage that are particularly important for efficient use of the nonlinear solvers.
A common usage paradigm is to assemble the problem Jacobian in the preconditioner storage B, rather than
A. In the case where they are identical, as in many simulations, this makes no difference. However, it allows
us to check the analytic Jacobian we construct in FormJacobian() by passing the -snes_mf_operator
flag. This causes PETSc to approximate the Jacobian using finite differencing of the function evaluation
(discussed in Finite Difference Jacobian Approximations), and the analytic Jacobian becomes merely the
preconditioner. Even if the analytic Jacobian is incorrect, it is likely that the finite difference approximation
will converge, and thus this is an excellent method to verify the analytic Jacobian. Moreover, if the analytic
Jacobian is incomplete (some terms are missing or approximate), -snes_mf_operator may be used to
obtain the exact solution, where the Jacobian approximation has been transferred to the preconditioner.
One such approximate Jacobian comes from “Picard linearization” which writes the nonlinear system as
F(x) := A(x)x − b = 0
where A(x) usually contains the lower-derivative parts of the equation. For example, the nonlinear diffusion
problem
−∇ · (κ(u)∇u) = 0
would be linearized as
Usually this linearization is simpler to implement than Newton and the linear problems are somewhat easier
to solve. In addition to using -snes_mf_operator with this approximation to the Jacobian, the Picard
iterative procedure can be performed by defining J(x) to be A(x). Sometimes this iteration exhibits better
global convergence than Newton linearization.
During successive calls to FormJacobian(), the user can either insert new matrix contexts or reuse old
ones, depending on the application requirements. For many sparse matrix formats, reusing the old space (and
merely changing the matrix elements) is more efficient; however, if the matrix structure completely changes,
creating an entirely new matrix context may be preferable. Upon subsequent calls to the FormJacobian()
routine, the user may wish to reinitialize the matrix entries to zero by calling MatZeroEntries(). See
Other Matrix Operations for details on the reuse of the matrix context.
The directory $PETSC_DIR/src/snes/tutorials provides a variety of examples.
Sometimes a nonlinear solver may produce a step that is not within the domain of a given function, for
example one with a negative pressure. When this occurs one can call SNESSetFunctionDomainError()
or SNESSetJacobianDomainError() to indicate to SNES the step is not valid. See Variational Inequal-
ities for how to provide SNES with bounds on the variables to solve (differential) variational inequalities
and how to control properties of the line step computed.
As summarized in Table PETSc Nonlinear Solvers, SNES includes several Newton-like nonlinear solvers based
on line search techniques and trust region methods. Also provided are several nonlinear Krylov methods, as
well as nonlinear methods involving decompositions of the problem.
Each solver may have associated with it a set of options, which can be set with routines and options database
commands provided for this purpose. A complete list can be found by consulting the manual pages or by
running a program with the -help option; we discuss just a few in the sections below.
The method SNESNEWTONLS (-snes_type newtonls) provides a line search Newton method for solv-
ing systems of nonlinear equations. By default, this technique employs cubic backtracking [DennisJrS83].
Alternative line search techniques are listed in Table PETSc Line Search Methods.
Every SNES has a line search context of type SNESLineSearch that may be retrieved using
There are several default options for the line searches. The order of polynomial approximation may be set
with -snes_linesearch_order or
for instance, 2 for quadratic or 3 for cubic. Sometimes, it may not be necessary to monitor the progress of
the nonlinear iteration. In this case, -snes_linesearch_norms or
may be used to turn off function, step, and solution norm computation at the end of the linesearch.
The default line search for the line search Newton method, SNESLINESEARCHBT involves several parameters,
which are set to defaults that are reasonable for many applications. The user can override the defaults by
using the following options:
• -snes_linesearch_alpha <alpha>
• -snes_linesearch_maxstep <max>
• -snes_linesearch_minlambda <tol>
Besides the backtracking linesearch, there are SNESLINESEARCHL2, which uses a polynomial secant mini-
mization of ||F (x)||2 , and SNESLINESEARCHCP, which minimizes F (x) · Y where Y is the search direction.
These are both potentially iterative line searches, which may be used to find a better-fitted steplength
in the case where a single secant search is not sufficient. The number of iterations may be set with
-snes_linesearch_max_it. In addition, the convergence criteria of the iterative line searches may be
set using function tolerances -snes_linesearch_rtol and -snes_linesearch_atol, and steplength
tolerance snes_linesearch_ltol.
Custom line search types may either be defined using SNESLineSearchShell, or by creating a custom
user line search type in the model of the preexisting ones and register it using
The trust region method in SNES for solving systems of nonlinear equations, SNESNEWTONTR (-snes_type
newtontr), is taken from the MINPACK project [MoreSGH84]. Several parameters can be set to control
the variation of the trust region size during the solution process. In particular, the user can control the
initial trust region radius, computed by
∆ = ∆0 kF0 k2 ,
A number of nonlinear Krylov methods are provided, including Nonlinear Richardson, conjugate gradient,
GMRES, and Anderson Mixing. These methods are described individually below. They are all instrumental
to PETSc’s nonlinear preconditioning.
Nonlinear Richardson. The nonlinear Richardson iteration merely takes the form of a line search-damped
fixed-point iteration of the form
xk+1 = xk − λF(xk ), k = 0, 1, . . . ,
where the default linesearch is SNESLINESEARCHL2. This simple solver is mostly useful as a nonlinear
smoother, or to provide line search stabilization to an inner method.
Nonlinear Conjugate Gradients. Nonlinear CG is equivalent to linear CG, but with the steplength
determined by line search (SNESLINESEARCHCP by default). Five variants (Fletcher-Reed, Hestenes-Steifel,
Polak-Ribiere-Polyak, Dai-Yuan, and Conjugate Descent) are implemented in PETSc and may be chosen
using
Anderson Mixing and Nonlinear GMRES Methods. Nonlinear GMRES and Anderson Mixing meth-
ods combine the last m iterates, plus a new fixed-point iteration iterate, into a residual-minimizing new
iterate.
Quasi-Newton Methods
Quasi-Newton methods store iterative rank-one updates to the Jacobian instead of computing it directly.
Three limited-memory quasi-Newton methods are provided, L-BFGS, which are described in Table PETSc
quasi-Newton solvers. These all are encapsulated under -snes_type qn and may be changed with
snes_qn_type. The default is L-BFGS, which provides symmetric updates to an approximate Jacobian.
This iteration is similar to the line search Newton methods.
One may also control the form of the initial Jacobian approximation with
The Full Approximation Scheme is a nonlinear multigrid correction. At each level, there is a recursive cycle
control SNES instance, and either one or two nonlinear solvers as smoothers (up and down). Problems set
up using the SNES DMDA interface are automatically coarsened. FAS differs slightly from PCMG, in that the
hierarchy is constructed recursively. However, much of the interface is a one-to-one map. We describe the
“get” operations here, and it can be assumed that each has a corresponding “set” operation. For instance,
the number of levels in the hierarchy may be retrieved using
SNESFASGetLevels(SNES snes, PetscInt *levels);
Much like the interface to PCMG described in Multigrid Preconditioners, there are interfaces to recover the
various levels’ cycles and smoothers. The level smoothers may be accessed with
SNESFASGetSmoother(SNES snes, PetscInt level, SNES *smooth);
SNESFASGetSmootherUp(SNES snes, PetscInt level, SNES *smooth);
SNESFASGetSmootherDown(SNES snes, PetscInt level, SNES *smooth);
Also akin to PCMG, the restriction and prolongation at a level may be acquired with
SNESFASGetInterpolation(SNES snes, PetscInt level, Mat *mat);
SNESFASGetRestriction(SNES snes, PetscInt level, Mat *mat);
In addition, FAS requires special restriction for solution-like variables, called injection. This may be set with
SNESFASGetInjection(SNES snes, PetscInt level, Mat *mat);.
Nonlinear Additive Schwarz methods (NASM) take a number of local nonlinear subproblems, solves them
independently in parallel, and combines those solutions into a new approximate solution.
SNESNASMSetSubdomains(SNES snes,PetscInt n,SNES subsnes[],VecScatter iscatter[],
,→VecScatter oscatter[],VecScatter gscatter[]);
allows for the user to create these local subdomains. Problems set up using the SNES DMDA interface are
automatically decomposed. To begin, the type of subdomain updates to the whole solution are limited to two
types borrowed from PCASM: PC_ASM_BASIC, in which the overlapping updates added. PC_ASM_RESTRICT
updates in a nonoverlapping fashion. This may be set with
SNESASPIN is a helper SNES type that sets up a nonlinearly preconditioned Newton’s method using NASM
as the preconditioner.
This section discusses options and routines that apply to all SNES solvers and problem classes. In particular,
we focus on convergence tests, monitoring routines, and tools for checking derivative computations.
Convergence Tests
Convergence of the nonlinear solvers can be detected in a variety of ways; the user can even specify a
customized test, as discussed below. Most of the nonlinear solvers use SNESConvergenceTestDefault(),
however, SNESNEWTONTR uses a method-specific additional convergence test as well. The convergence tests
involves several parameters, which are set by default to values that should be reasonable for a wide range
of problems. The user can customize the parameters to the problem at hand by using some of the following
routines and options.
One method of convergence testing is to declare convergence when the norm of the change in the solution
between successive iterations is less than some tolerance, stol. Convergence can also be determined based
on the norm of the function. Such a test can use either the absolute size of the norm, atol, or its relative
decrease, rtol, from an initial guess. The following routine sets these parameters, which are used in many
of the default SNES convergence tests:
This routine also sets the maximum numbers of allowable nonlinear iterations, its, and function evaluations,
fcts. The corresponding options database commands for setting these parameters are:
• -snes_atol <atol>
• -snes_rtol <rtol>
• -snes_stol <stol>
• -snes_max_it <its>
• -snes_max_funcs <fcts>
A related routine is SNESGetTolerances().
Convergence tests for trust regions methods often use an additional parameter that indicates the minimum
allowable trust region radius. The user can set this parameter with the option -snes_tr_tol <trtol>
or with the routine
Users can set their own customized convergence tests in SNES by using the command
The final argument of the convergence test routine, cctx, denotes an optional user-defined context for
private data. When solving systems of nonlinear equations, the arguments xnorm, gnorm, and f are the
current iterate norm, current step norm, and function norm, respectively. SNESConvergedReason should
be set positive for convergence and negative for divergence. See include/petscsnes.h for a list of values
for SNESConvergedReason.
Convergence Monitoring
By default the SNES solvers run silently without displaying information about the iterations. The user can
initiate monitoring with the command
The routine, mon, indicates a user-defined monitoring routine, where its and mctx respectively denote
the iteration number and an optional user-defined context for private data for the monitor routine. The
argument norm is the function norm.
The routine set by SNESMonitorSet() is called once after every successful step computation within
the nonlinear solver. Hence, the user can employ this routine for any application-specific computations
that should be done after the solution update. The option -snes_monitor activates the default SNES
monitor routine, SNESMonitorDefault(), while -snes_monitor_lg_residualnorm draws a simple
line graph of the residual norm’s convergence.
One can cancel hardwired monitoring routines for SNES at runtime with -snes_monitor_cancel.
As the Newton method converges so that the residual norm is small, say 10−10 , many of the final digits
printed with the -snes_monitor option are meaningless. Worse, they are different on different machines;
due to different round-off rules used by, say, the IBM RS6000 and the Sun SPARC. This makes testing
between different machines difficult. The option -snes_monitor_short causes PETSc to print fewer of
the digits of the residual norm as it gets smaller; thus on most of the machines it will always print the same
numbers making cross-process testing easier.
The routines
return the solution vector and function vector from a SNES context. These routines are useful, for instance, if
the convergence test requires some property of the solution or function other than those passed with routine
arguments.
Since hand-coding routines for Jacobian matrix evaluation can be error prone, SNES provides easy-to-use
support for checking these matrices against finite difference versions. In the simplest form of comparison,
users can employ the option -snes_test_jacobian to compare the matrices at several points. Although
not exhaustive, this test will generally catch obvious problems. One can compare the elements of the two
matrices by using the option -snes_test_jacobian_view , which causes the two matrices to be printed
to the screen.
Another means for verifying the correctness of a code for Jacobian computation is running the problem with
either the finite difference or matrix-free variant, -snes_fd or -snes_mf; see Finite Difference Jacobian
Approximations or Matrix-Free Methods. If a problem converges well with these matrix approximations but
not with a user-provided routine, the problem probably lies with the hand-coded matrix. See the note in
Jacobian Evaluation about assembling your Jabobian in the “preconditioner” slot Pmat.
The correctness of user provided MATSHELL Jacobians in general can be checked with MatShellTest-
MultTranspose() and MatShellTestMult().
The correctness of user provided MATSHELL Jacobians via TSSetRHSJacobian() can be checked with
TSRHSJacobianTestTranspose() and TSRHSJacobianTest() that check the correction of the
matrix-transpose vector product and the matrix-product. From the command line, these can be checked
with
• -ts_rhs_jacobian_test_mult_transpose
• -mat_shell_test_mult_transpose_view
• -ts_rhs_jacobian_test_mult
• -mat_shell_test_mult_view
Since exact solution of the linear Newton systems within (2.4) at each iteration can be costly, modifications
are often introduced that significantly reduce these expenses and yet retain the rapid convergence of Newton’s
method. Inexact or truncated Newton techniques approximately solve the linear systems using an iterative
scheme. In comparison with using direct methods for solving the Newton systems, iterative methods have
the virtue of requiring little space for matrix storage and potentially saving significant computational work.
Within the class of inexact Newton methods, of particular interest are Newton-Krylov methods, where the
subsidiary iterative technique for solving the Newton system is chosen from the class of Krylov subspace
projection methods. Note that at runtime the user can set any of the linear solver options discussed in KSP:
Linear System Solvers, such as -ksp_type <ksp_method> and -pc_type <pc_method>, to set the
Krylov subspace and preconditioner methods.
Two levels of iterations occur for the inexact techniques, where during each global or outer Newton iteration
a sequence of subsidiary inner iterations of a linear solver is performed. Appropriate control of the accuracy
to which the subsidiary iterative method solves the Newton system at each global iteration is critical, since
these inner iterations determine the asymptotic convergence rate for inexact Newton techniques. While the
Newton systems must be solved well enough to retain fast local convergence of the Newton’s iterates, use of
excessive inner iterations, particularly when kxk − x∗ k is large, is neither necessary nor economical. Thus,
the number of required inner iterations typically increases as the Newton process progresses, so that the
truncated iterates approach the true Newton iterates.
A sequence of nonnegative numbers {ηk } can be used to indicate the variable convergence criterion. In
this case, when solving a system of nonlinear equations, the update step of the Newton process remains
unchanged, and direct solution of the linear system is replaced by iteration on the system until the residuals
satisfy
(i)
krk k
≤ ηk ≤ η < 1.
kF(xk )k
Here x0 is an initial approximation of the solution, and k · k denotes an arbitrary norm in <n .
By default a constant relative convergence tolerance is used for solving the subsidiary linear systems within
the Newton-like methods of SNES. When solving a system of nonlinear equations, one can instead employ the
techniques of Eisenstat and Walker [EW96] to compute ηk at each step of the nonlinear solver by using the
option -snes_ksp_ew . In addition, by adding one’s own KSP convergence test (see Convergence Tests),
one can easily create one’s own, problem-dependent, inner convergence tests.
The SNES class fully supports matrix-free methods. The matrices specified in the Jacobian evaluation
routine need not be conventional matrices; instead, they can point to the data required to implement a
particular matrix-free method. The matrix-free variant is allowed only when the linear systems are solved by
an iterative method in combination with no preconditioning (PCNONE or -pc_type none), a user-provided
preconditioner matrix, or a user-provided preconditioner shell (PCSHELL, discussed in Preconditioners); that
is, obviously matrix-free methods cannot be used with a direct solver, approximate factorization, or other
preconditioner which requires access to explicit matrix entries.
The user can create a matrix-free context for use within SNES with the routine
MatCreateSNESMF(SNES snes,Mat *mat);
This routine creates the data structures needed for the matrix-vector products that arise within Krylov
space iterative methods [BS90]. The default SNES matrix-free approximations can also be invoked with
the command -snes_mf. Or, one can retain the user-provided Jacobian preconditioner, but replace the
user-provided Jacobian matrix with the default matrix free variant with the option -snes_mf_operator.
MatCreateSNESMF() uses
MatCreateMFFD(Vec x, Mat *mat);
which can also be used directly for users who need a matrix-free matrix but are not using SNES.
The user can set one parameter to control the Jacobian-vector product approximation with the command
MatMFFDSetFunctionError(Mat mat,PetscReal rerror);
The parameter rerror should be set to the square root of the relative error in the function evaluations,
erel ; the default is the square root of machine epsilon (about 10−8 in double precision), which assumes that
the functions are evaluated to full floating-point precision accuracy. This parameter can also be set from the
options database with -mat_mffd_err <err>
In addition, PETSc provides ways to register new routines to compute the differencing parameter (h); see
the manual page for MatMFFDSetType() and MatMFFDRegister(). We currently provide two default
routines accessible via -mat_mffd_type <ds or wp>. For the default approach there is one “tuning”
parameter, set with
MatMFFDDSSetUmin(Mat mat,PetscReal umin);
This parameter, umin (or umin ), is a bit involved; its default is 10−6 . Its command line form is
-mat_mffd_umin <umin>.
The Jacobian-vector product is approximated via the formula
F (u + h ∗ a) − F (u)
F ′ (u)a ≈
h
where h is computed via
(
uT a/kak22 if |uT a| > umin kak1
h = erel ·
umin sign(uT a)kak1 /kak22 otherwise.
This approach is taken from Brown and Saad [BS90]. The second approach, taken from Walker and Pernice,
[PW98], computes h via
p
1 + ||u||erel
h=
||a||
This has no tunable parameters,pbut note that inside the nonlinear solve for the entire linear iterative
process u does not change hence 1 + ||u|| need be computed only once. This information may be set with
the options
MatMFFDWPSetComputeNormU(Mat mat,PetscBool );
We include an explicit example of using matrix-free methods in ex3.c. Note that by using the option
-snes_mf one can easily convert any SNES code to use a matrix-free Newton-Krylov method without a
preconditioner. As shown in this example, SNESSetFromOptions() must be called after SNESSetJaco-
bian() to enable runtime switching between the user-specified Jacobian and the default SNES matrix-free
form.
Listing: src/snes/tutorials/ex3.c
static char help[] = "Newton methods to solve u'' + u^{2} = f in parallel.\n\
This example employs a user-defined monitoring routine and optionally a user-defined\
,→n\
/*
Include "petscdm.h" so that we can use data management objects (DMs)
Include "petscdmda.h" so that we can use distributed arrays (DMDAs).
Include "petscsnes.h" so that we can use SNES solvers. Note that this
file automatically includes:
petscsys.h - base PETSc routines
petscvec.h - vectors
petscmat.h - matrices
petscis.h - index sets
petscksp.h - Krylov subspace methods
petscviewer.h - viewers
petscpc.h - preconditioners
petscksp.h - linear solvers
*/
#include <petscdm.h>
#include <petscdmda.h>
#include <petscsnes.h>
/*
User-defined routines.
*/
(continues on next page)
/*
User-defined application context
*/
typedef struct {
DM da; /* distributed array */
Vec F; /* right-hand-side of PDE */
PetscMPIInt rank; /* rank of processor */
PetscMPIInt size; /* size of communicator */
PetscReal h; /* mesh spacing */
PetscBool sjerr; /* if or not to test jacobian domain error */
} ApplicationCtx;
/*
User-defined context for monitoring
*/
typedef struct {
PetscViewer viewer;
} MonitorCtx;
/*
User-defined context for checking candidate iterates that are
determined by line search methods
*/
typedef struct {
Vec last_step; /* previous iterate */
PetscReal tolerance; /* tolerance for changes between successive iterates */
ApplicationCtx *user;
} StepCheckCtx;
typedef struct {
PetscInt its0; /* num of previous outer KSP iterations */
} SetSubKSPCtx;
PetscFunctionBeginUser;
PetscCall(PetscInitialize(&argc, &argv, (char *)0, help));
PetscCallMPI(MPI_Comm_rank(PETSC_COMM_WORLD, &ctx.rank));
PetscCallMPI(MPI_Comm_size(PETSC_COMM_WORLD, &ctx.size));
PetscCall(PetscOptionsGetInt(NULL, NULL, "-n", &N, NULL));
ctx.h = 1.0 / (N - 1);
ctx.sjerr = PETSC_FALSE;
PetscCall(PetscOptionsGetBool(NULL, NULL, "-test_jacobian_domain_error", &ctx.sjerr,
,→ NULL));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create nonlinear solver context
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
PetscCall(SNESCreate(PETSC_COMM_WORLD, &snes));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create vector data structures; set function evaluation routine
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create distributed array (DMDA) to manage parallel grid and vectors
*/
PetscCall(DMDACreate1d(PETSC_COMM_WORLD, DM_BOUNDARY_NONE, N, 1, 1, NULL, &ctx.da));
PetscCall(DMSetFromOptions(ctx.da));
PetscCall(DMSetUp(ctx.da));
/*
Extract global and local vectors from DMDA; then duplicate for remaining
vectors that are the same types
*/
PetscCall(DMCreateGlobalVector(ctx.da, &x));
PetscCall(VecDuplicate(x, &r));
PetscCall(VecDuplicate(x, &F));
ctx.F = F;
PetscCall(VecDuplicate(x, &U));
/*
Set function evaluation routine and vector. Whenever the nonlinear
solver needs to compute the nonlinear function, it will call this
routine.
- Note that the final routine argument is the user-defined
context that provides application-specific data for the
function evaluation routine.
*/
PetscCall(SNESSetFunction(snes, r, FormFunction, &ctx));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create matrix data structure; set Jacobian evaluation routine
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
PetscCall(MatCreate(PETSC_COMM_WORLD, &J));
PetscCall(MatSetSizes(J, PETSC_DECIDE, PETSC_DECIDE, N, N));
PetscCall(MatSetFromOptions(J));
PetscCall(MatSeqAIJSetPreallocation(J, 3, NULL));
PetscCall(MatMPIAIJSetPreallocation(J, 3, NULL, 3, NULL));
/*
Set Jacobian matrix data structure and default Jacobian evaluation
routine. Whenever the nonlinear solver needs to compute the
Jacobian matrix, it will call this routine.
- Note that the final routine argument is the user-defined
context that provides application-specific data for the
Jacobian evaluation routine.
*/
PetscCall(SNESSetJacobian(snes, J, J, FormJacobian, &ctx));
/*
Optionally allow user-provided preconditioner
*/
PetscCall(PetscOptionsHasName(NULL, NULL, "-user_precond", &flg));
if (flg) {
KSP ksp;
PC pc;
PetscCall(SNESGetKSP(snes, &ksp));
PetscCall(KSPGetPC(ksp, &pc));
PetscCall(PCSetType(pc, PCSHELL));
PetscCall(PCShellSetApply(pc, MatrixFreePreconditioner));
}
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Customize nonlinear solver; set runtime options
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Set an optional user-defined monitoring routine
*/
PetscCall(PetscViewerDrawOpen(PETSC_COMM_WORLD, 0, 0, 0, 0, 400, 400, &monP.
,→viewer));
/*
Set names for some vectors to facilitate monitoring (optional)
*/
PetscCall(PetscObjectSetName((PetscObject)x, "Approximate Solution"));
PetscCall(PetscObjectSetName((PetscObject)U, "Exact Solution"));
/*
Set SNES/KSP/KSP/PC runtime options, e.g.,
-snes_view -snes_monitor -ksp_type <ksp> -pc_type <pc>
*/
PetscCall(SNESSetFromOptions(snes));
/*
Set an optional user-defined routine to check the validity of candidate
iterates that are determined by line search methods
if (post_check) {
PetscCall(PetscPrintf(PETSC_COMM_WORLD, "Activating post step checking routine\n
,→"));
checkP.tolerance = 1.0;
checkP.user = &ctx;
/*
Print parameters used for convergence testing (optional) ... just
to demonstrate this routine; this information is also printed with
the option -snes_view
*/
PetscCall(SNESGetTolerances(snes, &abstol, &rtol, &stol, &maxit, &maxf));
PetscCall(PetscPrintf(PETSC_COMM_WORLD, "atol=%g, rtol=%g, stol=%g, maxit=%"␣
,→PetscInt_FMT ", maxf=%" PetscInt_FMT "\n", (double)abstol, (double)rtol,␣
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Initialize application:
Store right-hand-side of PDE and exact solution
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Get local grid boundaries (for 1-dimensional DMDA):
xs, xm - starting grid index, width of local grid (no ghost points)
*/
PetscCall(DMDAGetCorners(ctx.da, &xs, NULL, NULL, &xm, NULL, NULL));
/*
Get pointers to vector data
*/
PetscCall(DMDAVecGetArray(ctx.da, F, &FF));
PetscCall(DMDAVecGetArray(ctx.da, U, &UU));
UU[i] = xp * xp * xp;
xp += ctx.h;
}
/*
Restore vectors
*/
PetscCall(DMDAVecRestoreArray(ctx.da, F, &FF));
PetscCall(DMDAVecRestoreArray(ctx.da, U, &UU));
if (viewinitial) {
PetscCall(VecView(U, 0));
PetscCall(VecView(F, 0));
}
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Evaluate initial guess; then solve nonlinear system
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Note: The user should initialize the vector, x, with the initial guess
for the nonlinear solver prior to calling SNESSolve(). In particular,
to employ an initial guess of zero, the user should explicitly set
this vector to zero by calling VecSet().
*/
PetscCall(FormInitialGuess(x));
PetscCall(SNESSolve(snes, NULL, x));
PetscCall(SNESGetIterationNumber(snes, &its));
PetscCall(PetscPrintf(PETSC_COMM_WORLD, "Number of SNES iterations = %" PetscInt_
,→FMT "\n", its));
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Check solution and clean up
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Check the error
*/
PetscCall(VecAXPY(x, none, U));
PetscCall(VecNorm(x, NORM_2, &norm));
PetscCall(PetscPrintf(PETSC_COMM_WORLD, "Norm of error %g Iterations %" PetscInt_
,→FMT "\n", (double)norm, its));
if (ctx.sjerr) {
SNESType snestype;
PetscCall(SNESGetType(snes, &snestype));
PetscCall(PetscPrintf(PETSC_COMM_WORLD, "SNES Type %s\n", snestype));
}
/*
Free work space. All PETSc objects should be destroyed when they
/* ------------------------------------------------------------------- */
/*
FormInitialGuess - Computes initial guess.
Input/Output Parameter:
. x - the solution vector
*/
PetscErrorCode FormInitialGuess(Vec x)
{
PetscScalar pfive = .50;
PetscFunctionBeginUser;
PetscCall(VecSet(x, pfive));
PetscFunctionReturn(PETSC_SUCCESS);
}
/* ------------------------------------------------------------------- */
/*
FormFunction - Evaluates nonlinear function, F(x).
Input Parameters:
. snes - the SNES context
. x - input vector
. ctx - optional user-defined context, as set by SNESSetFunction()
Output Parameter:
. f - function vector
Note:
The user-defined context can contain any application-specific
data needed for the function evaluation.
*/
PetscErrorCode FormFunction(SNES snes, Vec x, Vec f, void *ctx)
{
ApplicationCtx *user = (ApplicationCtx *)ctx;
DM da = user->da;
PetscScalar *ff, d;
const PetscScalar *xx, *FF;
PetscInt i, M, xs, xm;
Vec xlocal;
/*
Get pointers to vector data.
- The vector xlocal includes ghost point; the vectors x and f do
NOT include ghost points.
- Using DMDAVecGetArray() allows accessing the values using global ordering
*/
PetscCall(DMDAVecGetArrayRead(da, xlocal, (void *)&xx));
PetscCall(DMDAVecGetArray(da, f, &ff));
PetscCall(DMDAVecGetArrayRead(da, user->F, (void *)&FF));
/*
Get local grid boundaries (for 1-dimensional DMDA):
xs, xm - starting grid index, width of local grid (no ghost points)
*/
PetscCall(DMDAGetCorners(da, &xs, NULL, NULL, &xm, NULL, NULL));
PetscCall(DMDAGetInfo(da, NULL, &M, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,␣
,→NULL, NULL, NULL));
/*
Set function values for boundary points; define local interior grid point range:
xsi - starting interior grid index
xei - ending interior grid index
*/
if (xs == 0) { /* left boundary */
ff[0] = xx[0];
xs++;
xm--;
}
if (xs + xm == M) { /* right boundary */
ff[xs + xm - 1] = xx[xs + xm - 1] - 1.0;
xm--;
}
/*
Compute function over locally owned part of the grid (interior points only)
*/
d = 1.0 / (user->h * user->h);
for (i = xs; i < xs + xm; i++) ff[i] = d * (xx[i - 1] - 2.0 * xx[i] + xx[i + 1]) +␣
,→xx[i] * xx[i] - FF[i];
/*
Restore vectors
*/
PetscCall(DMDAVecRestoreArrayRead(da, xlocal, (void *)&xx));
PetscCall(DMDAVecRestoreArray(da, f, &ff));
/* ------------------------------------------------------------------- */
/*
FormJacobian - Evaluates Jacobian matrix.
Input Parameters:
. snes - the SNES context
. x - input vector
. dummy - optional user-defined context (not used here)
Output Parameters:
. jac - Jacobian matrix
. B - optionally different preconditioning matrix
. flag - flag indicating matrix structure
*/
PetscErrorCode FormJacobian(SNES snes, Vec x, Mat jac, Mat B, void *ctx)
{
ApplicationCtx *user = (ApplicationCtx *)ctx;
PetscScalar *xx, d, A[3];
PetscInt i, j[3], M, xs, xm;
DM da = user->da;
PetscFunctionBeginUser;
if (user->sjerr) {
PetscCall(SNESSetJacobianDomainError(snes));
PetscFunctionReturn(PETSC_SUCCESS);
}
/*
Get pointer to vector data
*/
PetscCall(DMDAVecGetArrayRead(da, x, &xx));
PetscCall(DMDAGetCorners(da, &xs, NULL, NULL, &xm, NULL, NULL));
/*
Get range of locally owned matrix
*/
PetscCall(DMDAGetInfo(da, NULL, &M, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,␣
,→NULL, NULL, NULL));
/*
Determine starting and ending local indices for interior grid points.
Set Jacobian entries for boundary points.
*/
/*
Interior grid points
- Note that in this case we set all elements for a particular
row at once.
*/
d = 1.0 / (user->h * user->h);
for (i = xs; i < xs + xm; i++) {
j[0] = i - 1;
j[1] = i;
j[2] = i + 1;
A[0] = A[2] = d;
A[1] = -2.0 * d + 2.0 * xx[i];
PetscCall(MatSetValues(jac, 1, &i, 3, j, A, INSERT_VALUES));
}
/*
Assemble matrix, using the 2-step process:
MatAssemblyBegin(), MatAssemblyEnd().
By placing code between these two statements, computations can be
done while messages are in transition.
PetscCall(MatAssemblyBegin(jac, MAT_FINAL_ASSEMBLY));
PetscCall(DMDAVecRestoreArrayRead(da, x, &xx));
PetscCall(MatAssemblyEnd(jac, MAT_FINAL_ASSEMBLY));
PetscFunctionReturn(PETSC_SUCCESS);
}
/* ------------------------------------------------------------------- */
/*
Monitor - Optional user-defined monitoring routine that views the
current iterate with an x-window plot. Set by SNESMonitorSet().
Input Parameters:
snes - the SNES context
its - iteration number
norm - 2-norm function value (may be estimated)
ctx - optional user-defined context for private data for the
monitor routine, as set by SNESMonitorSet()
Note:
See the manpage for PetscViewerDrawOpen() for useful runtime options,
such as -nox to deactivate all x-window output.
*/
PetscErrorCode Monitor(SNES snes, PetscInt its, PetscReal fnorm, void *ctx)
{
PetscFunctionBeginUser;
PetscCall(PetscPrintf(PETSC_COMM_WORLD, "iter = %" PetscInt_FMT ",SNES Function␣
,→norm %g\n", its, (double)fnorm));
PetscCall(SNESGetSolution(snes, &x));
PetscCall(VecView(x, monP->viewer));
PetscFunctionReturn(PETSC_SUCCESS);
}
/* ------------------------------------------------------------------- */
/*
PreCheck - Optional user-defined routine that checks the validity of
candidate steps of a line search method. Set by SNESLineSearchSetPreCheck().
Input Parameters:
snes - the SNES context
xcurrent - current solution
y - search direction and length
Output Parameters:
y - proposed step (search direction and length) (possibly changed)
changed_y - tells if the step has changed or not
*/
PetscErrorCode PreCheck(SNESLineSearch linesearch, Vec xcurrent, Vec y, PetscBool␣
,→*changed_y, void *ctx)
{
PetscFunctionBeginUser;
*changed_y = PETSC_FALSE;
PetscFunctionReturn(PETSC_SUCCESS);
}
/* ------------------------------------------------------------------- */
/*
PostCheck - Optional user-defined routine that checks the validity of
candidate steps of a line search method. Set by SNESLineSearchSetPostCheck().
Input Parameters:
snes - the SNES context
ctx - optional user-defined context for private data for the
monitor routine, as set by SNESLineSearchSetPostCheck()
xcurrent - current solution
y - search direction and length
x - the new candidate iterate
Output Parameters:
y - proposed step (search direction and length) (possibly changed)
x - current iterate (possibly modified)
*/
PetscErrorCode PostCheck(SNESLineSearch linesearch, Vec xcurrent, Vec y, Vec x,␣
,→PetscBool *changed_y, PetscBool *changed_x, void *ctx)
{
PetscInt i, iter, xs, xm;
StepCheckCtx *check;
PetscFunctionBeginUser;
*changed_x = PETSC_FALSE;
*changed_y = PETSC_FALSE;
PetscCall(SNESLineSearchGetSNES(linesearch, &snes));
check = (StepCheckCtx *)ctx;
user = check->user;
PetscCall(SNESGetIterationNumber(snes, &iter));
/*
If we fail the user-defined check for validity of the candidate iterate,
then modify the iterate as we like. (Note that the particular modification
below is intended simply to demonstrate how to manipulate this data, not
as a meaningful or appropriate choice.)
*/
for (i = xs; i < xs + xm; i++) {
if (!PetscAbsScalar(xa[i])) rdiff = 2 * check->tolerance;
else rdiff = PetscAbsScalar((xa[i] - xa_last[i]) / xa[i]);
if (rdiff > check->tolerance) {
tmp = xa[i];
xa[i] = .5 * (xa[i] + xa_last[i]);
*changed_x = PETSC_TRUE;
PetscCall(PetscPrintf(PETSC_COMM_WORLD, " Altering entry %" PetscInt_FMT ":␣
,→x=%g, x_last=%g, diff=%g, x_new=%g\n", i, (double)PetscAbsScalar(tmp),␣
}
}
PetscCall(DMDAVecRestoreArray(da, check->last_step, &xa_last));
PetscCall(DMDAVecRestoreArray(da, x, &xa));
}
PetscCall(VecCopy(x, check->last_step));
PetscFunctionReturn(PETSC_SUCCESS);
}
/* ------------------------------------------------------------------- */
/*
PostSetSubKSP - Optional user-defined routine that reset SubKSP options when␣
,→hierarchical bjacobi PC is used
e.g,
Input Parameters:
linesearch - the LineSearch context
xcurrent - current solution
y - search direction and length
x - the new candidate iterate
Output Parameters:
y - proposed step (search direction and length) (possibly changed)
x - current iterate (possibly modified)
*/
PetscErrorCode PostSetSubKSP(SNESLineSearch linesearch, Vec xcurrent, Vec y, Vec x,␣
,→PetscBool *changed_y, PetscBool *changed_x, void *ctx)
{
SetSubKSPCtx *check;
PetscInt iter, its, sub_its, maxit;
KSP ksp, sub_ksp, *sub_ksps;
PC pc;
PetscReal ksp_ratio;
SNES snes;
PetscFunctionBeginUser;
PetscCall(SNESLineSearchGetSNES(linesearch, &snes));
check = (SetSubKSPCtx *)ctx;
PetscCall(SNESGetIterationNumber(snes, &iter));
PetscCall(SNESGetKSP(snes, &ksp));
PetscCall(KSPGetPC(ksp, &pc));
PetscCall(PCBJacobiGetSubKSP(pc, NULL, NULL, &sub_ksps));
sub_ksp = sub_ksps[0];
PetscCall(KSPGetIterationNumber(ksp, &its)); /* outer KSP iteration number␣
,→*/
if (iter) {
PetscCall(PetscPrintf(PETSC_COMM_WORLD, " ...PostCheck snes iteration %"␣
,→PetscInt_FMT ", ksp_it %" PetscInt_FMT " %" PetscInt_FMT ", subksp_it %" PetscInt_
}
check->its0 = its; /* save current outer KSP iteration number */
PetscFunctionReturn(PETSC_SUCCESS);
}
/* ------------------------------------------------------------------- */
/*
Input Parameters:
+ pc - preconditioner
- x - input vector
Output Parameter:
. y - preconditioned vector
*/
PetscErrorCode MatrixFreePreconditioner(PC pc, Vec x, Vec y)
{
PetscFunctionBeginUser;
PetscCall(VecCopy(x, y));
PetscFunctionReturn(PETSC_SUCCESS);
}
Table Jacobian Options summarizes the various matrix situations that SNES supports. In particular, different
linear system matrices and preconditioning matrices are allowed, as well as both matrix-free and application-
provided preconditioners. If ex3.c is run with the options -snes_mf and -user_precond then it uses a
matrix-free application of the matrix-vector multiple and a user provided matrix free Jacobian.
∗
Use either the generic MatCreate() or a format-specific variant such as MatCreateAIJ().
†
Set user-defined matrix formation routine with SNESSetJacobian() or with a DM variant such as
DMDASNESSetJacobianLocal()
SNES also provides some less well-integrated code to apply matrix-free finite differencing using an automat-
ically computed measurement of the noise of the functions. This can be selected with -snes_mf_version
2; it does not use MatCreateMFFD() but has similar options that start with -snes_mf_ instead of
-mat_mffd_. Note that this alternative prefix only works for version 2 differencing.
PETSc provides some tools to help approximate the Jacobian matrices efficiently via finite differences. These
tools are intended for use in certain situations where one is unable to compute Jacobian matrices analytically,
and matrix-free methods do not work well without a preconditioner, due to very poor conditioning. The
approximation requires several steps:
• First, one colors the columns of the (not yet built) Jacobian matrix, so that columns of the same color
do not share any common rows.
• Next, one creates a MatFDColoring data structure that will be used later in actually computing the
Jacobian.
• Finally, one tells the nonlinear solvers of SNES to use the SNESComputeJacobianDefaultColor()
routine to compute the Jacobians.
A code fragment that demonstrates this process is given below.
ISColoring iscoloring;
MatFDColoring fdcoloring;
MatColoring coloring;
/*
This initializes the nonzero structure of the Jacobian. This is artificial
because clearly if we had a routine to compute the Jacobian we wouldn't
need to use finite differences.
*/
FormJacobian(snes,x, &J, &J, &user);
/*
Color the matrix, i.e. determine groups of columns that share no common
rows. These columns in the Jacobian can all be computed simultaneously.
*/
MatColoringCreate(J, &coloring);
MatColoringSetType(coloring,MATCOLORINGSL);
MatColoringSetFromOptions(coloring);
MatColoringApply(coloring, &iscoloring);
MatColoringDestroy(&coloring);
/*
Create the data structure that SNESComputeJacobianDefaultColor() uses
to compute the actual Jacobians via finite differences.
*/
MatFDColoringCreate(J,iscoloring, &fdcoloring);
ISColoringDestroy(&iscoloring);
MatFDColoringSetFunction(fdcoloring,(PetscErrorCode (*)(void))FormFunction, &user);
MatFDColoringSetFromOptions(fdcoloring);
/*
Tell SNES to use the routine SNESComputeJacobianDefaultColor()
to compute Jacobians.
*/
SNESSetJacobian(snes,J,J,SNESComputeJacobianDefaultColor,fdcoloring);
Of course, we are cheating a bit. If we do not have an analytic formula for computing the Jacobian, then how
do we know what its nonzero structure is so that it may be colored? Determining the structure is problem
dependent, but fortunately, for most structured grid problems (the class of problems for which PETSc was
originally designed) if one knows the stencil used for the nonlinear function one can usually fairly easily
obtain an estimate of the location of nonzeros in the matrix. This is harder in the unstructured case, but
one typically knows where the nonzero entries are from the mesh topology and distribution of degrees of
freedom. If using DMPlex (DMPlex: Unstructured Grids) for unstructured meshes, the nonzero locations
will be identified in DMCreateMatrix() and the procedure above can be used. Most external packages for
unstructured meshes have similar functionality.
One need not necessarily use a MatColoring object to determine a coloring. For example, if a grid can be
colored directly (without using the associated matrix), then that coloring can be provided to MatFDCol-
oringCreate(). Note that the user must always preset the nonzero structure in the matrix regardless of
which coloring routine is used.
PETSc provides the following coloring algorithms, which can be selected using MatColoringSetType()
or via the command line argument -mat_coloring_type.
As for the matrix-free computation of Jacobians (Matrix-Free Methods), two parameters affect the accuracy
of the finite difference Jacobian approximation. These are set with the command
MatFDColoringSetParameters(MatFDColoring fdcoloring,PetscReal rerror,PetscReal umin);
The parameter rerror is the square root of the relative error in the function evaluations, erel ; the default
is the square root of machine epsilon (about 10−8 in double precision), which assumes that the functions are
evaluated approximately to floating-point precision accuracy. The second parameter, umin, is a bit more
involved; its default is 10e−6 . Column i of the Jacobian matrix (denoted by F:i ) is approximated by the
formula
F (u + h ∗ dxi ) − F (u)
F:i′ ≈
h
where h is computed via:
(
ui if |ui | > umin
h = erel ·
umin · sign(ui ) otherwise.
for MATMFFD_WP (default). These parameters may be set from the options database with
-mat_fd_coloring_err <err>
-mat_fd_coloring_umin <umin>
-mat_fd_type <htype>
Note that MatColoring type MATCOLORINGSL, MATCOLORINGLF, and MATCOLORINGID are sequential
algorithms. MATCOLORINGJP and MATCOLORINGGREEDY are parallel algorithms, although in practice they
may create more colors than the sequential algorithms. If one computes the coloring iscoloring reason-
ably with a parallel algorithm or by knowledge of the discretization, the routine MatFDColoringCreate()
is scalable. An example of this for 2D distributed arrays is given below that uses the utility routine DMCre-
ateColoring().
DMCreateColoring(da,IS_COLORING_GHOSTED, &iscoloring);
MatFDColoringCreate(J,iscoloring, &fdcoloring);
MatFDColoringSetFromOptions(fdcoloring);
ISColoringDestroy( &iscoloring);
Note that the routine MatFDColoringCreate() currently is only supported for the AIJ and BAIJ matrix
formats.
SNES can also solve (differential) variational inequalities with box (bound) constraints. These are nonlinear
algebraic systems with additional inequality constraints on some or all of the variables: Li ≤ ui ≤ Hi . For
example, the pressure variable cannot be negative. Some, or all, of the lower bounds may be negative infinity
(indicated to PETSc with SNES_VI_NINF) and some, or all, of the upper bounds may be infinity (indicated
by SNES_VI_INF). The commands
SNESVISetVariableBounds(SNES,Vec L,Vec H);
SNESVISetComputeVariableBounds(SNES snes, PetscErrorCode (*compute)(SNES,Vec,Vec))
are used to indicate that one is solving a variational inequality. Problems with box constraints can be solved
with the reduced space, SNESVINEWTONRSLS, and semi-smooth SNESVINEWTONSSLS solvers.
The option -snes_vi_monitor turns on extra monitoring of the active set associated with the bounds
and -snes_vi_type allows selecting from several VI solvers, the default is preferred.
SNESLineSearchSetPreCheck() and SNESLineSearchSetPostCheck() can also be used to control
properties of the steps selected by SNES.
and passed run-time options using the -npc_ prefix. Nonlinear preconditioning comes in two flavors: left
and right. The side may be changed using -snes_npc_side or SNESSetNPCSide(). Left nonlinear
preconditioning redefines the nonlinear function as the action of the nonlinear preconditioner M;
FM (x) = M(x, b) − x.
Right nonlinear preconditioning redefines the nonlinear function as the function on the action of the nonlinear
preconditioner;
F(M(x, b)) = b,
which can be interpreted as putting the preconditioner into “striking distance” of the solution by outer
acceleration.
In addition, basic patterns of solver composition are available with the SNESType SNESCOM-
POSITE. This allows for two or more SNES instances to be combined additively or multiplica-
tively. By command line, a set of SNES types may be given by comma separated list ar-
gument to -snes_composite_sneses. There are additive (SNES_COMPOSITE_ADDITIVE),
additive with optimal damping (SNES_COMPOSITE_ADDITIVEOPTIMAL), and multiplicative
(SNES_COMPOSITE_MULTIPLICATIVE) variants which may be set with
SNESCompositeSetType(SNES,SNESCompositeType);
The TS library provides a framework for the scalable solution of ODEs and DAEs arising from the discretiza-
tion of time-dependent PDEs.
Simple Example: Consider the PDE
ut = uxx
discretized with centered finite differences in space yielding the semi-discrete equation
ui+1 − 2ui + ui−1
(ui )t = ,
h2
ut = Ãu;
. P
or with piecewise linear finite elements approximation in space u(x, t) = i ξi (t)ϕi (x) yielding the semi-
discrete equation
Bξ ′ (t) = Aξ(t)
in which
.
un i = ξi (tn ) = u(xi , tn ),
. un+1 i − un i
ξ ′ (tn+1 ) = ,
dtn
A is the stiffness matrix, and B is the identity for finite differences or the mass matrix for the finite element
method.
The PETSc interface for solving time dependent problems assumes the problem is written in the form
In general, this is a differential algebraic equation (DAE)4 . For ODE with nontrivial mass matrices such as
arise in FEM, the implicit/DAE interface significantly reduces overhead to prepare the system for algebraic
solvers (SNES/KSP) by having the user assemble the correctly shifted matrix. Therefore this interface is also
useful for ODE systems.
To solve an ODE or DAE one uses:
• Function F (t, u, u̇)
4 If the matrix Fu̇ (t) = ∂F /∂ u̇ is nonsingular then it is an ODE and can be transformed to the standard explicit form,
although this transformation may not lead to efficient algorithms.
The vector R is an optional location to store the residual. The arguments to the function f() are the
timestep context, current time, input state u, input time derivative u̇, and the (optional) user-provided
context funP. If F (t, u, u̇) = u̇ then one need not call this function.
• Function G(t, u), if it is nonzero, is provided with the function
The arguments for the function fjac() are the timestep context, current time, input state u, input
derivative u̇, input shift σ, matrix A, preconditioning matrix B, and the (optional) user-provided
context jacP.
The Jacobian needed for the nonlinear system is, by the chain rule,
dF ∂F ∂ u̇ ∂F
= |un |un + |un .
dun ∂ u̇ ∂u ∂u
|u = σ, where the
∂ u̇ n
For any ODE integration method the approximation of u̇ is linear in un hence ∂u
shift σ depends on the ODE integrator and time step but not on the function being integrated. Thus
dF
= σFu̇ (tn , un , u̇n ) + Fu (tn , un , u̇n ).
dun
This explains why the user provide Jacobian is in the given form for all integration methods. An
equivalent way to derive the formula is to note that
dF
= (1/δt)Fu̇ + Fu (tn , un , u̇n ).
dun
But Fu̇ = 1, in this special case, resulting in the expected Jacobian I/δt − fu (t, un ).
• Jacobian Gu
If using a fully implicit method and the function G() is provided, one also can provide an appropriate
(approximate) Jacobian matrix of G().
The arguments for the function fjac() are the timestep context, current time, input state u, matrix
A, preconditioning matrix B, and the (optional) user-provided context jacP.
Providing appropriate F () and G() for your problem allows for the easy runtime switching between explicit,
semi-implicit (IMEX), and fully implicit methods.
Currently supported types are TSEULER, TSRK (Runge-Kutta), TSBEULER, TSCN (Crank-Nicolson),
TSTHETA, TSGLLE (generalized linear), TSPSEUDO, and TSSUNDIALS (only if the Sundials package is
installed), or the command line option
-ts_type euler,rk,beuler,cn,theta,gl,pseudo,sundials,eimex,arkimex,rosw.
Here, “current” refers to the timestep being used to attempt to promote the solution form un to un+1 .
One sets the total number of timesteps to run or the total time to run (whatever is first) with the commands
The solve call implicitly sets up the timestep context; this can be done explicitly with
TSSetUp(TS ts);
TSDestroy(TS *ts);
TSStep(TS ts);
You can find a discussion of DAEs in [AP98] or Scholarpedia. In PETSc, TS deals with the semi-discrete
form of the equations, so that space has already been discretized. If the DAE depends explicitly on the
coordinate x, then this will just appear as any other data for the equation, not as an explicit argument.
Thus we have
F (t, u, u̇) = 0
In this form, only fully implicit solvers are appropriate. However, specialized solvers for restricted forms of
DAE are supported by PETSc. Below we consider an ODE which is augmented with algebraic constraints
on the variables.
0 = ḣ(t, u)
dh ∂h
= u̇ +
du ∂t
dh ∂h
= f (t, u, z) +
du ∂t
dh df
If the Jacobian du dz is non-singular, then we have precisely a semi-explicit index-1 DAE, and we can once
again use the PETSc IMEX tools to solve it. A common example of an index-2 DAE is the incompressible
Navier-Stokes equations, since the continuity equation ∇ · u = 0 does not involve the pressure. Using PETSc
IMEX with the above conversion then corresponds to the Segregated Runge-Kutta method applied to this
equation [ColomesB16].
For “stiff” problems or those with multiple time scales F () will be treated implicitly using a method suitable
for stiff problems and G() will be treated explicitly when using an IMEX method like TSARKIMEX. F ()
is typically linear or weakly nonlinear while G() may have very strong nonlinearities such as arise in non-
oscillatory methods for hyperbolic PDE. The user provides three pieces of information, the APIs for which
have been described above.
• “Slow” part G(t, u) using TSSetRHSFunction().
• “Stiff” part F (t, u, u̇) using TSSetIFunction().
• Jacobian Fu + σFu̇ using TSSetIJacobian().
The user needs to set TSSetEquationType() to TS_EQ_IMPLICIT or higher if the problem is implicit;
e.g., F (t, u, u̇) = M u̇ − f (t, u), where M is not the identity matrix:
• the problem is an implicit ODE (defined implicitly through TSSetIFunction()) or
• a DAE is being solved.
An IMEX problem representation can be made implicit by setting TSARKIMEXSetFullyImplicit().
In PETSc, DAEs and ODEs are formulated as F (t, u, u̇) = G(t, u), where F () is meant to be integrated
implicitly and G() explicitly. An IMEX formulation such as M u̇ = f (t, u) + g(t, u) requires the user to
provide M −1 g(t, u) or solve g(t, u) − M x = 0 in place of G(t, u). General cases such as F (t, u, u̇) = G(t, u)
are not amenable to IMEX Runge-Kutta, but can be solved by using fully implicit methods. Some use-case
examples for TSARKIMEX are listed in Table 2.12 and a list of methods with a summary of their properties
is given in IMEX Runge-Kutta schemes.
Table 2.13 lists of the currently available IMEX Runge-Kutta schemes. For each method, it gives the
-ts_arkimex_type name, the reference, the total number of stages/implicit stages, the order/stage-order,
the implicit stability properties (IM), stiff accuracy (SA), the existence of an embedded scheme, and dense
output (DO).
ROSW are linearized implicit Runge-Kutta methods known as Rosenbrock W-methods. They can accom-
modate inexact Jacobian matrices in their formulation. A series of methods are available in PETSc are listed
in Table 2.14 below. For each method, it gives the reference, the total number of stages and implicit stages,
the scheme order and stage order, the implicit stability properties (IM), stiff accuracy (SA), the existence
of an embedded scheme, dense output (DO), the capacity to use inexact Jacobian matrices (-W), and high
order integration of differential algebraic equations (PDAE).
In this section, we describe explicit and implicit time stepping methods with global error estimation that
are introduced in [Con16]. The solution vector for a GLEE method is either [y, ỹ] or [y,ε], where y is the
solution, ỹ is the “auxiliary solution,” and ε is the error. The working vector that TSGLEE uses is Y = [y,ỹ],
or [y,ε]. A GLEE method is defined by
• (p, r, s): (order, steps, and stages),
• γ: factor representing the global error ratio,
• A, U, B, V : method coefficients,
• S: starting method to compute the working vector from the solution (say at the beginning of time
integration) so that Y = Sy,
• F : finalizing method to compute the solution from the working vector,y = F Y .
• Fembed : coefficients for computing the auxiliary solution ỹ from the working vector (ỹ = Fembed Y ),
• Ferror : coefficients to compute the estimated error vector from the working vector (ε = Ferror Y ).
• Serror : coefficients to initialize the auxiliary solution (ỹ or ε) from a specified error vector (ε). It is
currently implemented only for r = 2. We have yaux = Serror [0] ∗ ε + Serror [1] ∗ y, where yaux is the
2nd component of the working vector Y .
The methods can be described in two mathematically equivalent forms: propagate two components (“y ỹ
form”) and propagating the solution and its estimated error (“yε form”). The two forms are not explicitly
specified in TSGLEE; rather, the specific values of B, U, S, F, Fembed , and Ferror characterize whether the
method is in y ỹ or yε form.
The API used by this TS method includes:
• TSGetSolutionComponents: Get all the solution components of the working vector
ierr = TSGetSolutionComponents(TS,int*,Vec*)
Call with NULL as the last argument to get the total number of components in the working vector Y
(this is r (not r − 1)), then call to get the i-th solution component.
• TSGetAuxSolution: Returns the auxiliary solution ỹ (computed as Fembed Y )
ierr = TSGetAuxSolution(TS,Vec*)
• TSSetTimeError: Initializes the auxiliary solution (ỹ or ε) for a specified initial error.
ierr = TSSetTimeError(TS,Vec)
The local error is estimated as ε(n + 1) − ε(n). This is to be used in the error control. The error in y ỹ GLEE
1
is ε(n) = 1−γ ∗ (ỹ(n) − y(n)).
Note that y and ỹ are reported to TSAdapt basic (TSADAPTBASIC), and thus it computes the local error
as εloc = (ỹ − y). However, the actual local error is εloc = εn+1 − εn = 1−γ
1
∗ [(ỹ − y)n+1 − (ỹ − y)n ].
Table 2.15 lists currently available GL schemes with global error estimation [Con16].
To use a fully implicit method like TSTHETA or TSGL, either provide the Jacobian of F () (and G() if G()
is provided) or use a DM that provides a coloring so the Jacobian can be computed efficiently via finite
differences.
The explicit Euler and Runge-Kutta methods require the ODE be in the form
u̇ = G(u, t).
The user can either call TSSetRHSFunction() and/or they can call TSSetIFunction() (so long as
the function provided to TSSetIFunction() is equivalent to u̇ + F̃ (t, u)) but the Jacobians need not be
provided.5
The Explicit Runge-Kutta timestepper with variable timesteps is an implementation of the standard Runge-
Kutta with an embedded method. The error in each timestep is calculated using the solutions from the
Runge-Kutta method and its embedded method (the 2-norm of the difference is used). The default method
is the 3rd-order Bogacki-Shampine method with a 2nd-order embedded method (TSRK3BS). Other available
methods are the 5th-order Fehlberg RK scheme with a 4th-order embedded method (TSRK5F), the 5th-
order Dormand-Prince RK scheme with a 4th-order embedded method (TSRK5DP), the 5th-order Bogacki-
Shampine RK scheme with a 4th-order embedded method (TSRK5BS, and the 6th-, 7th, and 8th-order robust
Verner RK schemes with a 5th-, 6th, and 7th-order embedded method, respectively (TSRK6VR, TSRK7VR,
TSRK8VR). Variable timesteps cannot be used with RK schemes that do not have an embedded method
(TSRK1FE - 1st-order, 1-stage forward Euler, TSRK2A - 2nd-order, 2-stage RK scheme, TSRK3 - 3rd-order,
3-stage RK scheme, TSRK4 - 4-th order, 4-stage RK scheme).
TSSetProblemType(ts,TS_LINEAR);
TSSetRHSFunction(ts,NULL,TSComputeRHSFunctionLinear,NULL);
TSSetRHSJacobian(ts,A,A,TSComputeRHSJacobianConstant,NULL);
or
TSSetProblemType(ts,TS_LINEAR);
TSSetIFunction(ts,NULL,TSComputeIFunctionLinear,NULL);
TSSetIJacobian(ts,A,A,TSComputeIJacobianConstant,NULL);
• u̇ = A(t)u. Use
TSSetProblemType(ts,TS_LINEAR);
TSSetRHSFunction(ts,NULL,TSComputeRHSFunctionLinear,NULL);
TSSetRHSJacobian(ts,A,A,YourComputeRHSJacobian, &appctx);
TSSetProblemType(ts,TS_LINEAR);
TSSetIFunction(ts,NULL,TSComputeIFunctionLinear,NULL);
TSSetIJacobian(ts,A,A,YourComputeIJacobian, &appctx);
5 PETSc will automatically translate the function provided to the appropriate form.
Most of the time stepping methods avaialable in PETSc have an error estimation and error control mech-
anism. This mechanism is implemented by changing the step size in order to maintain user specified abso-
lute and relative tolerances. The PETSc object responsible with error control is TSAdapt. The available
TSAdapt types are listed in the following table.
When using TSADAPTBASIC (the default), the user typically provides a desired absolute TolA or a relative
TolR error tolerance by invoking TSSetTolerances() or at the command line with options -ts_atol
and -ts_rtol. The error estimate is based on the local truncation error, so for every step the algorithm
verifies that the estimated local truncation error satisfies the tolerances provided by the user and computes
a new step size to be taken. For multistage methods, the local truncation is obtained by comparing the
solution y to a lower order pb = p − 1 approximation, yb, where p is the order of the method and pb the order
of yb.
The adaptive controller at step n computes a tolerance level
where the errors are computed componentwise, m is the dimension of y and -ts_adapt_wnormtype is 2
(default). If -ts_adapt_wnormtype is infinity (max norm), then
where αmin =-ts_adapt_clip[0] and αmax =-ts_adapt_clip[1] keep the change in ∆t to within a
certain factor, and β < 1 is chosen through -ts_adapt_safety so that there is some margin to which the
tolerances are satisfied and so that the probability of rejection is decreased.
This adaptive controller works in the following way. After completing step k, if wltek+1 ≤ 1.0, then the step
is accepted and the next step is modified according to eq:hnew; otherwise, the step is rejected and retaken
with the step length computed in (2.5).
TSADAPTGLEE is an extension of the basic adaptor to treat TolA and TolR as separate criteria. it can also
control global errors if the integrator (e.g., TSGLEE) provides this information.
For problems that involve discontinuous right hand sides, one can set an “event” function g(t, u) for PETSc
to detect and locate the times of discontinuities (zeros of g(t, u)). Events can be defined through the event
monitoring routine
,→PetscErrorCode (*postevent)(TS,PetscInt,PetscInt[],PetscReal,Vec,PetscBool,void*␣
,→eventP),void *eventP);
Here, nevents denotes the number of events, direction sets the type of zero crossing to be detected for
an event (+1 for positive zero-crossing, -1 for negative zero-crossing, and 0 for both), terminate conveys
whether the time-stepping should continue or halt when an event is located, eventmonitor is a user-
defined routine that specifies the event description, postevent is an optional user-defined routine to take
specific actions following an event.
The arguments to eventhandler() are the timestep context, current time, input state u, array of event
function value, and the (optional) user-provided context eventP.
The arguments to postevent() routine are the timestep context, number of events occurred, indices of
events occured, current time, input state u, a boolean flag indicating forward solve (1) or adjoint solve (0),
and the (optional) user-provided context eventP.
The event monitoring functionality is only available with PETSc’s implicit time-stepping solvers TSTHETA,
TSARKIMEX, and TSROSW.
Discretized finite element problems often have the form M u̇ = G(t, u) where M is the mass matrix. Such
problems can be solved using DMTSSetIFunction() with implicit integrators. When M is nonsingu-
lar (i.e., the problem is an ODE, not a DAE), explicit integrators can be applied to u̇ = M −1 G(t, u) or
u̇ = M̂ −1 G(t, u), where M̂ is the lumped mass matrix. While the true mass matrix generally has a dense in-
verse and thus must be solved iteratively, the lumped mass matrix is diagonal (e.g., computed via collocated
quadrature or row sums of M ). To have PETSc create and apply a (lumped) mass matrix automatically,
first use DMTSSetRHSFunction()` to specify :math:`G` and set a ``PetscFE` using
``DMAddField() and DMCreateDS(), then call either DMTSCreateRHSMassMatrix() or DMTSCre-
ateRHSMassMatrixLumped() to automatically create the mass matrix and a KSP that will be used to
apply M −1 . This KSP can be customized using the "mass_" prefix.
The TS library provides a framework based on discrete adjoint models for sensitivity analysis for ODEs and
DAEs. The ODE/DAE solution process (henceforth called the forward run) can be obtained by using either
explicit or implicit solvers in TS, depending on the problem properties. Currently supported method types
are TSRK (Runge-Kutta) explicit methods and TSTHETA implicit methods, which include TSBEULER and
TSCN.
∂Ψi
= λi
∂y0
and
∂Ψi ∂y0
= µi + λ i ( ).
∂p ∂p
To perform the discrete adjoint sensitivity analysis one first sets up the TS object for a regular forward run
but with one extra function call
TSSetSaveTrajectory(TS ts),
where numcost denotes ncost . If F () is a function of p one needs to also provide the Jacobian −Fp with
or
and provide the ODE RHS function (which evaluates the integrand r) with
Similar to the settings for the original ODE, Jacobians of the integrand can be provided with
where drdyf = dr/dy, drdpf = dr/dp. Since the integral term is additive to the cost function, its gradient
information will be included in λ and µ.
Lastly, one starts the backward run by calling
TSAdjointSolve(TS ts).
equations and inequality constraints for bounds on the trajectory. The governing ODE system is
ϕ′ = ωB (ω − ωS )
′
2H/ωS ω = pm − pmax sin(ϕ) − D(ω − ωS ), t 0 ≤ t ≤ tF ,
pmax is a positive number when the system operates normally. At an event such as fault incidence/removal,
pmax will change to 0 temporarily and back to the original value after the fault is fixed. The objective is
to maximize pm subject to the above ODE constraints and ϕ < ϕS during all times. To accommodate the
inequality constraint, we want to compute the sensitivity of the cost function
Z tF
2
Ψ(pm , ϕ) = −pm + c (max(0, ϕ − ϕS )) dt
t0
In the adjoint run, λ and µ are initialized as [0 0]T and [−1] at the final time tF . After TSAdjointSolve(),
the sensitivity of the cost function w.r.t. initial conditions is given by the sensitivity variable λ (at time t0 )
directly. And the sensitivity of the cost function w.r.t. the parameter pm can be computed (by users) as
T
dΨ d [ϕ(t0 ) ω(t0 )]
= µ(t0 ) + λ(t0 ) .
dpm dpm
For explicit methods where one does not need to provide the Jacobian Fu for the forward solve one still does
need it for the backward solve and thus must call
Examples include:
• discrete adjoint sensitivity using explicit and implicit time stepping methods for an ODE problem TS
Tutorial ex20adj,
• an optimization problem using the discrete adjoint models of the ERK (for nonstiff ODEs) and the
Theta methods (for stiff DAEs) TS Tutorial ex20opt_ic and TS Tutorial ex20opt_p,
• an ODE-constrained optimization using the discrete adjoint models of the Theta methods for cost
function with an integral term TS Power Grid Tutorial ex3opt,
• discrete adjoint sensitivity using the Crank-Nicolson methods for DAEs with discontinuities TS Power
Grid Stability Tutorial ex9busadj,
• a DAE-constrained optimization problem using the discrete adjoint models of the Crank-Nicolson
methods for cost function with an integral term TS Power Grid Tutorial ex9busopt,
• discrete adjoint sensitivity using the Crank-Nicolson methods for a PDE problem TS Advection-
Diffusion-Reaction Tutorial ex5adj.
Checkpointing
The discrete adjoint model requires the states (and stage values in the context of multistage timestepping
methods) to evaluate the Jacobian matrices during the adjoint (backward) run. By default, PETSc stores
the whole trajectory to disk as binary files, each of which contains the information for a single time step
including state, time, and stage values (optional). One can also make PETSc store the trajectory to memory
with the option -ts_trajectory_type memory. However, there might not be sufficient memory capacity
especially for large-scale problems and long-time integration.
A so-called checkpointing scheme is needed to solve this problem. The scheme stores checkpoints at selective
time steps and recomputes the missing information. The revolve library is used by PETSc TSTrajec-
tory to generate an optimal checkpointing schedule that minimizes the recomputations given a limited
number of available checkpoints. One can specify the number of available checkpoints with the option
-ts_trajectory_max_cps_ram [maximum number of checkpoints in RAM]. Note that one
checkpoint corresponds to one time step.
The revolve library also provides an optimal multistage checkpointing scheme that uses both
RAM and disk for storage. This scheme is automatically chosen if one uses both the option
-ts_trajectory_max_cps_ram [maximum number of checkpoints in RAM] and the option
-ts_trajectory_max_cps_disk [maximum number of checkpoints on disk].
Some other useful options are listed below.
• -ts_trajectory_view prints the total number of recomputations,
• -ts_monitor and -ts_adjoint_monitor allow users to monitor the progress of the adjoint work
flow,
• -ts_trajectory_type visualization may be used to save the whole trajectory for visualiza-
tion. It stores the solution and the time, but no stage values. The binary files generated can be read into
MATLAB via the script $PETSC_DIR/share/petsc/matlab/PetscReadBinaryTrajectory.
m.
Sundials is a parallel ODE solver developed by Hindmarsh et al. at LLNL. The TS library provides an
interface to use the CVODE component of Sundials directly from PETSc. (To configure PETSc to use
Sundials, see the installation guide, installation/index.htm.)
To use the Sundials integrators, call
The user can then directly set preconditioner options; alternatively, the usual runtime options can be em-
ployed via -pc_xxx.
Finally, one can set the Sundials tolerances via
TSSundialsSetTolerance(TS ts,double abs,double rel);
where abs denotes the absolute tolerance and rel the relative tolerance.
Other PETSc-Sundials options include
TSSundialsSetGramSchmidtType(TS ts,TSSundialsGramSchmidtType type);
where type is either SUNDIALS_MODIFIED_GS or SUNDIALS_UNMODIFIED_GS. This may be set via the
options data base with -ts_sundials_gramschmidt_type <modifed,unmodified>.
The routine
TSSundialsSetMaxl(TS ts,PetscInt restart);
sets the number of vectors in the Krylov subpspace used by GMRES. This may be set in the options database
with -ts_sundials_maxl maxl.
TChem6 is a package originally developed at Sandia National Laboratory that can read in CHEMKIN7
data files and compute the right hand side function and its Jacobian for a reaction ODE system. To
utilize PETSc’s ODE solvers for these systems, first install PETSc with the additional configure option
--download-tchem. We currently provide two examples of its use; one for single cell reaction and one for
an “artificial” one dimensional problem with periodic boundary conditions and diffusion of all species. The
self-explanatory examples are the The TS tutorial extchem and The TS tutorial extchemfield.
Simple Example: TS provides a general code for performing pseudo timestepping with a variable timestep
at each physical node point. For example, instead of directly attacking the steady-state problem
G(u) = 0,
ut = G(u).
. un+1 − un
ut =
dtn
with the backward Euler method, we obtain nonlinear equations at a series of pseudo-timesteps
1
B(un+1 − un ) = G(un+1 ).
dtn
6 bitbucket.org/jedbrown/tchem
7 en.wikipedia.org/wiki/CHEMKIN
For this problem the user must provide G(u), the time steps dtn and the left-hand-side matrix B (or
optionally, if the timestep is position independent and B is the identity matrix, a scalar timestep), as well
as optionally the Jacobian of G(u).
More generally, this can be applied to implicit ODE and DAE for which the transient form is
F (u, u̇) = 0.
The arguments to the function f() are the timestep context, the current time, the input for the
function, the output for the function and the (optional) user-provided context variable fP.
• Provide the (approximate) Jacobian matrix of G(u) and a function to compute it at each Newton
iteration. This is done with the command
TSSetRHSJacobian(TS ts,Mat Amat, Mat Pmat,PetscErrorCode (*f)(TS,PetscReal,Vec,
,→Mat,Mat,void*),void *fP);
The arguments for the function f() are the timestep context, the current time, the location where the
Jacobian is to be computed, the (approximate) Jacobian matrix, an alternative approximate Jacobian
matrix used to construct the preconditioner, and the optional user-provided context, passed in as fP.
The user must provide the Jacobian as a matrix; thus, if using a matrix-free approach, one must create
a MATSHELL matrix.
In addition, the user must provide a routine that computes the pseudo-timestep. This is slightly different
depending on if one is using a constant timestep over the entire grid, or it varies with location.
• For location-independent pseudo-timestepping, one uses the routine
TSPseudoSetTimeStep(TS ts,PetscInt(*dt)(TS,PetscReal*,void*),void* dtctx);
The function dt is a user-provided function that computes the next pseudo-timestep. As a default one
can use TSPseudoTimeStepDefault(TS,PetscReal*,void*) for dt. This routine updates the
pseudo-timestep with one of two strategies: the default
||F (un−1 )||
dtn = dtincrement ∗ dtn−1 ∗
||F (un )||
or, the alternative,
||F (u0 )||
dtn = dtincrement ∗ dt0 ∗
||F (un )||
which can be set with the call
TSPseudoIncrementDtFromInitialDt(TS ts);
The Toolkit for Advanced Optimization (TAO) focuses on algorithms for the solution of large-scale op-
timization problems on high-performance architectures. Methods are available for unconstrained, bound-
constrained and generally constrained optimization, nonlinear least squares problems, variational inequalities
and complementarity constraints.
To help the user start using TAO immediately, we introduce here a simple uniprocessor example. Please
read TAO Algorithms for a more in-depth discussion on using the TAO solvers. The code presented below
minimizes the extended Rosenbrock function f : Rn → R defined by
X
m−1
f (x) = α(x2i+1 − x22i )2 + (1 − x2i )2 ,
i=0
where n = 2m is the number of variables. Note that while we use the C language to introduce the TAO
software, the package is fully usable from C++ and Fortran. PETSc for Fortran Users discusses additional
issues concerning Fortran usage.
The code in the example contains many of the components needed to write most TAO programs and thus
is illustrative of the features present in complex optimization problems. Note that for display purposes
we have omitted some nonessential lines of code as well as the (essential) code required for the rou-
tine FormFunctionGradient, which evaluates the function and gradient, and the code for FormHes-
sian, which evaluates the Hessian matrix for Rosenbrock’s function. The complete code is available in
$TAO_DIR/src/unconstrained/tutorials/rosenbrock1.c. The following sections annotate the lines of code in
the example.
Listing: src/tao/unconstrained/tutorials/rosenbrock1.c
#include <petsctao.h>
typedef struct {
PetscInt n; /* dimension */
PetscReal alpha; /* condition parameter */
PetscBool chained;
} AppCtx;
} else {
PetscCall(PetscPrintf(PetscObjectComm((PetscObject)tao), "error between LMVM␣
,→MatMult and MatSolve: %e\n", (double)mult_solve_dist));
}
PetscCall(VecDestroy(&in));
PetscCall(VecDestroy(&out));
PetscCall(VecDestroy(&out2));
}
PetscCall(TaoDestroy(&tao));
PetscCall(VecDestroy(&x));
PetscCall(MatDestroy(&H));
PetscCall(PetscFinalize());
return ierr;}
Many TAO applications will follow an ordered set of procedures for solving an optimization problem: The
user creates a Tao context and selects a default algorithm. Call-back routines as well as vector (Vec) and
matrix (Mat) data structures are then set. These call-back routines will be used for evaluating the objective
function, gradient, and perhaps the Hessian matrix. The user then invokes TAO to solve the optimization
problem and finally destroys the Tao context. A list of the necessary functions for performing these steps
using TAO is shown below.
Note that the solver algorithm selected through the function TaoSetType() can be overridden at runtime
by using an options database. Through this database, the user not only can select a minimization method
(e.g., limited-memory variable metric, conjugate gradient, Newton with line search or trust region) but
also can prescribe the convergence tolerance, set various monitoring routines, set iterative methods and
preconditions for solving the linear systems, and so forth. See TAO Algorithms for more information on the
solver methods available in TAO.
Header File
#include <petsctao.h>
TaoCreate(MPI_Comm, Tao*);
routine. Much like creating PETSc vector and matrix objects, the first argument is an MPI communicator.
An MPI1 communicator indicates a collection of processors that will be used to evaluate the objective
function, compute constraints, and provide derivative information. When only one processor is being used,
the communicator PETSC_COMM_SELF can be used with no understanding of MPI. Even parallel users need
to be familiar with only the basic concepts of message passing and distributed-memory computing. Most
applications running TAO in parallel environments can employ the communicator PETSC_COMM_WORLD to
indicate all processes known to PETSc in a given run.
The routine
TaoSetType(Tao, TaoType);
can be used to set the algorithm TAO uses to solve the application. The various types of TAO solvers and the
flags that identify them will be discussed in the following sections. The solution method should be carefully
chosen depending on the problem being solved. Some solvers, for instance, are meant for problems with no
constraints, whereas other solvers acknowledge constraints in the problem and handle them accordingly. The
user must also be aware of the derivative information that is available. Some solvers require second-order
information, while other solvers require only gradient or function information. The command line option
-tao_type followed by a TAO method will override any method specified by the second argument. The
command line option -tao_type bqnls, for instance, will specify the limited-memory quasi-Newton line
search method for bound-constrained problems. Note that the TaoType variable is a string that requires
quotation marks in an application program, but quotation marks are not required at the command line.
Each TAO solver that has been created should also be destroyed by using the
TaoDestroy(Tao tao);
command. This routine frees the internal data structures used by the solver.
Command-line Options
Additional options for the TAO solver can be be set from the command line by using the
TaoSetFromOptions(Tao)
routine. This command also provides information about runtime options when the user includes the -help
option on the command line.
In addition to common command line options shared by all TAO solvers, each TAO method also implements
its own specialized options. Please refer to the documentation for individual methods for more details.
1 For more on MPI and PETSc, see Running PETSc Programs.
Defining Variables
In all the optimization solvers, the application must provide a Vec object of appropriate dimension to
represent the variables. This vector will be cloned by the solvers to create additional work space within the
solver. If this vector is distributed over multiple processors, it should have a parallel distribution that allows
for efficient scaling, inner products, and function evaluations. This vector can be passed to the application
object by using the
TaoSetSolution(Tao, Vec);
routine. When using this routine, the application should initialize the vector with an approximate solution
of the optimization problem before calling the TAO solver. This vector will be used by the TAO solver to
store the solution. Elsewhere in the application, this solution vector can be retrieved from the application
object by using the
TaoGetSolution(Tao, Vec*);
routine. This routine takes the address of a Vec in the second argument and sets it to the solution vector
used in the application.
Users of TAO are required to provide routines that perform function evaluations. Depending on the solver
chosen, they may also have to write routines that evaluate the gradient vector and Hessian matrix.
Application Context
Writing a TAO application may require use of an application context. An application context is a structure
or object defined by an application developer, passed into a routine also written by the application developer,
and used within the routine to perform its stated task.
For example, a routine that evaluates an objective function may need parameters, work vectors, and other
information. This information, which may be specific to an application and necessary to evaluate the
objective, can be collected in a single structure and used as one of the arguments in the routine. The address
of this structure will be cast as type (void*) and passed to the routine in the final argument. Many
examples of these structures are included in the TAO distribution.
This technique offers several advantages. In particular, it allows for a uniform interface between TAO and
the applications. The fundamental information needed by TAO appears in the arguments of the routine,
while data specific to an application and its implementation is confined to an opaque pointer. The routines
can access information created outside the local scope without the use of global variables. The TAO solvers
and application objects will never access this structure, so the application developer has complete freedom
to define it. If no such structure or needed by the application then a NULL pointer can be used.
TAO solvers that minimize an objective function require the application to evaluate the objective function.
Some solvers may also require the application to evaluate derivatives of the objective function. Routines
that perform these computations must be identified to the application object and must follow a strict calling
sequence.
Routines should follow the form
in order to evaluate an objective function f : Rn → R. The first argument is the TAO Solver object, the
second argument is the n-dimensional vector that identifies where the objective should be evaluated, and
the fourth argument is an application context. This routine should use the third argument to return the
objective value evaluated at the point specified by the vector in the second argument.
This routine, and the application context, should be passed to the application object by using the
routine. The first argument in this routine is the TAO solver object, the second argument is a function
pointer to the routine that evaluates the objective, and the third argument is the pointer to an appropriate
application context. Although the final argument may point to anything, it must be cast as a (void*)
type. This pointer will be passed back to the developer in the fourth argument of the routine that evaluates
the objective. In this routine, the pointer can be cast back to the appropriate type. Examples of these
structures and their usage are provided in the distribution.
Many TAO solvers also require gradient information from the application The gradient of the objective
function is specified in a similar manner. Routines that evaluate the gradient should have the calling
sequence
where the first argument is the TAO solver object, the second argument is the variable vector, the third
argument is the gradient vector, and the fourth argument is the user-defined application context. Only the
third argument in this routine is different from the arguments in the routine for evaluating the objective
function. The numbers in the gradient vector have no meaning when passed into this routine, but they
should represent the gradient of the objective at the specified point at the end of the routine. This routine,
and the user-defined pointer, can be passed to the application object by using the
routine. In this routine, the first argument is the Tao object, the second argument is the optional vector
to hold the computed gradient, the third argument is the function pointer, and the fourth object is the
application context, cast to (void*).
Instead of evaluating the objective and its gradient in separate routines, TAO also allows the user to evaluate
the function and the gradient in the same routine. In fact, some solvers are more efficient when both function
and gradient information can be computed in the same routine. These routines should follow the form
where the first argument is the TAO solver and the second argument points to the input vector for use in
evaluating the function and gradient. The third argument should return the function value, while the fourth
argument should return the gradient vector. The fifth argument is a pointer to a user-defined context. This
context and the name of the routine should be set with the call
where the arguments are the TAO application, the optional vector to be used to hold the computed gradient,
a function pointer, and a pointer to a user-defined context.
The TAO example problems demonstrate the use of these application contexts as well as specific instances
of function, gradient, and Hessian evaluation routines. All these routines should return the integer 0 after
successful completion and a nonzero integer if the function is undefined at that point or an error occurred.
Hessian Evaluation
Some optimization routines also require a Hessian matrix from the user. The routine that evaluates the
Hessian should have the form
where the first argument of this routine is a TAO solver object. The second argument is the point at which
the Hessian should be evaluated. The third argument is the Hessian matrix, and the sixth argument is
a user-defined context. Since the Hessian matrix is usually used in solving a system of linear equations,
a preconditioner for the matrix is often needed. The fourth argument is the matrix that will be used for
preconditioning the linear system; in most cases, this matrix will be the same as the Hessian matrix. The fifth
argument is the flag used to set the Hessian matrix and linear solver in the routine KSPSetOperators().
One can set the Hessian evaluation routine by calling the
routine. The first argument is the TAO Solver object. The second and third arguments are, respectively, the
Mat object where the Hessian will be stored and the Mat object that will be used for the preconditioning
(they may be the same). The fourth argument is the function that evaluates the Hessian, and the fifth
argument is a pointer to a user-defined context, cast to (void*).
Finite Differences
Finite-difference approximations can be used to compute the gradient and the Hessian of an objective func-
tion. These approximations will slow the solve considerably and are recommended primarily for checking
the accuracy of hand-coded gradients and Hessians. These routines are
and
respectively. They can be set by using TaoSetGradient() and TaoSetHessian() or through the
options database with the options -tao_fdgrad and -tao_fd, respectively.
The efficiency of the finite-difference Hessian can be improved if the coloring of the matrix is known. If the
application programmer creates a PETSc MatFDColoring object, it can be applied to the finite-difference
approximation by setting the Hessian evaluation routine to
and using the MatFDColoring object as the last (void *) argument to TaoSetHessian().
One also can use finite-difference approximations to directly check the correctness of the gradient and/or
Hessian evaluation routines. This process can be initiated from the command line by using the special TAO
solver tao_fd_test together with the option -tao_test_gradient or -tao_test_hessian.
Matrix-Free Methods
TAO fully supports matrix-free methods. The matrices specified in the Hessian evaluation routine need not
be conventional matrices; instead, they can point to the data required to implement a particular matrix-free
method. The matrix-free variant is allowed only when the linear systems are solved by an iterative method in
combination with no preconditioning (PCNONE or -pc_type none), a user-provided preconditioner matrix,
or a user-provided preconditioner shell (PCSHELL). In other words, matrix-free methods cannot be used if a
direct solver is to be employed. Details about using matrix-free methods are provided in the User-Guide.
Constraints
Some optimization problems also impose constraints on the variables or intermediate application states. The
user defines these constraints through the appropriate TAO interface functions and call-back routines where
necessary.
Variable Bounds
The simplest type of constraint on an optimization problem puts lower or upper bounds on the variables.
Vectors that represent lower and upper bounds for each variable can be set with the
command. The first vector and second vector should contain the lower and upper bounds, respectively.
When no upper or lower bound exists for a variable, the bound may be set to PETSC_INFINITY or
PETSC_NINFINITY. After the two bound vectors have been set, they may be accessed with the command
TaoGetVariableBounds().
Since not all solvers recognize the presence of bound constraints on variables, the user must be careful to
select a solver that acknowledges these bounds.
General Constraints
Some TAO algorithms also support general constraints as a linear or nonlinear function of the optimization
variables. These constraints can be imposed either as equalities or inequalities. TAO currently does not
make any distinctions between linear and nonlinear constraints, and implements them through the same
software interfaces.
In the equality constrained case, TAO assumes that the constraints are formulated as ce (x) = 0 and requires
the user to implement a call-back routine for evaluating ce (x) at a given vector of optimization variables,
As in the previous call-back routines, the first argument is the TAO solver object. The second and third
arguments are the vector of optimization variables (input) and vector of equality constraints (output),
respectively. The final argument is a pointer to the user-defined application context, cast into (void*).
Generally constrained TAO algorithms also require a second user call-back function to compute the constraint
Jacobian matrix ∇x ce (x),
where the first and last arguments are the TAO solver object and the application context pointer as before.
The second argument is the vector of optimization variables at which the computation takes place. The
third and fourth arguments are the constraint Jacobian and its pseudo-inverse (optional), respectively. The
pseudoinverse is optional, and if not available, the user can simply set it to the constraint Jacobian itself.
These call-back functions are then given to the TAO solver using the interface functions
and
Inequality constraints are assumed to be formulated as ci (x) ≤ 0 and follow the same workflow as equality
constraints using the TaoSetInequalityConstraintsRoutine() and TaoSetJacobianInequal-
ityRoutine() interfaces.
Some TAO algorithms may adopt an alternative double-sided cl ≤ ci (x) ≤ cu formulation and require
the lower and upper bounds cl and cu to be set using the TaoSetInequalityBounds(Tao,Vec,Vec)
interface. Please refer to the documentation for each TAO algorithm for further details.
Solving
Once the application and solver have been set up, the solver can be
TaoSolve(Tao);
Convergence
Although TAO and its solvers set default parameters that are useful for many problems, the user may need
to modify these parameters in order to change the behavior and convergence of various algorithms.
One convergence criterion for most algorithms concerns the number of digits of accuracy needed in the
solution. In particular, the convergence test employed by TAO attempts to stop when the error in the
constraints is less than ϵcrtol and either
||g(X)|| ≤ ϵgatol ,
||g(X)||/|f (X)| ≤ ϵgrtol , or
||g(X)||/|g(X0 )| ≤ ϵgttol ,
where X is the current approximation to the true solution X ∗ and X0 is the initial guess. X ∗ is unknown, so
TAO estimates f (X)−f (X ∗ ) with either the square of the norm of the gradient or the duality gap. A relative
tolerance of ϵf rtol = 0.01 indicates that two significant digits are desired in the objective function. Each solver
sets its own convergence tolerances, but they can be changed by using the routine TaoSetTolerances().
Another set of convergence tolerances terminates the solver when the norm of the gradient function (or
Lagrangian function for bound-constrained problems) is sufficiently close to zero.
Other stopping criteria include a minimum trust-region radius or a maximum number of iterations. These
parameters can be set with the routines TaoSetTrustRegionTolerance() and TaoSetMaximumIt-
erations() Similarly, a maximum number of function evaluations can be set with the command TaoSet-
MaximumFunctionEvaluations(). -tao_max_it, and -tao_max_funcs.
Viewing Status
To see parameters and performance statistics for the solver, the routine
TaoView(Tao tao)
can be used. This routine will display to standard output the number of function evaluations need by the
solver and other information specific to the solver. This same output can be produced by using the command
line option -tao_view.
The progress of the optimization solver can be monitored with the runtime option -tao_monitor. Although
monitoring routines can be customized, the default monitoring routine will print out several relevant statistics
to the screen.
The user also has access to information about the current solution. The current iteration number, objec-
tive function value, gradient norm, infeasibility norm, and step length can be retrieved with the following
command.
The last argument returns a code that indicates the reason that the solver terminated. Positive numbers
indicate that a solution has been found, while negative numbers indicate a failure. A list of reasons can be
found in the manual page for TaoGetConvergedReason().
Obtaining a Solution
After exiting the TaoSolve() function, the solution and the gradient can be recovered with the following
routines.
TaoGetSolution(Tao, Vec*);
TaoGetGradient(Tao, Vec*, NULL, NULL);
Note that the Vec returned by TaoGetSolution() will be the same vector passed to TaoSetSolu-
tion(). This information can be obtained during user-defined routines such as a function evaluation and
customized monitoring routine or after the solver has terminated.
Certain special classes of problems solved with TAO utilize specialized code interfaces that are described
below per problem type.
PDE-constrained Optimization
min f (u, v)
u,v
subject to g(u, v) = 0,
where the state variable u is the solution to the discretized partial differential equation defined by g and
parametrized by the design variable v, and f is an objective function. The Lagrange multipliers on the
constraint are denoted by y. This method is set by using the linearly constrained augmented Lagrangian
TAO solver tao_lcl.
We make two main assumptions when solving these problems: the objective function and PDE constraints
have been discretized so that we can treat the optimization problem as finite dimensional and ∇u g(u, v) is
invertible for all u and v.
Unlike other TAO solvers where the solution vector contains only the optimization variables, PDE-
constrained problems solved with tao_lcl combine the design and state variables together in a monolithic
solution vector xT = [uT , v T ]. Consequently, the user must provide index sets to separate the two,
where the first IS is a PETSc IndexSet containing the indices of the state variables and the second IS the
design variables.
PDE constraints have the general form g(x) = 0, where c : Rn → Rm . These constraints should be specified
in a routine, written by the user, that evaluates g(x). The routine that evaluates the constraint equations
should have the form
The first argument of this routine is a TAO solver object. The second argument is the variable vector at
which the constraint function should be evaluated. The third argument is the vector of function values g(x),
and the fourth argument is a pointer to a user-defined context. This routine and the user-defined context
should be set in the TAO solver with the
command. In this function, the first argument is the TAO solver object, the second argument a vector
in which to store the constraints, the third argument is a function point to the routine for evaluating the
constraints, and the fourth argument is a pointer to a user-defined context.
The Jacobian of g(x) is the matrix in Rm×n such that each column contains the partial derivatives of g(x)
with respect to one variable. The evaluation of the Jacobian of g should be performed by calling the
routines. In these functions, The first argument is the TAO solver object. The second argument is the
variable vector at which to evaluate the Jacobian matrix, the third argument is the Jacobian matrix, and
the last argument is a pointer to a user-defined context. The fourth and fifth arguments of the Jacobian
evaluation with respect to the state variables are for providing PETSc matrix objects for the preconditioner
and for applying the inverse of the state Jacobian, respectively. This inverse matrix may be PETSC_NULL,
in which case TAO will use a PETSc Krylov subspace solver to solve the state system. These evaluation
routines should be registered with TAO by using the
routines. The first argument is the TAO solver object, and the second argument is the matrix in which the
Jacobian information can be stored. For the state Jacobian, the third argument is the matrix that will be
used for preconditioning, and the fourth argument is an optional matrix for the inverse of the state Jacobian.
One can use PETSC_NULL for this inverse argument and let PETSc apply the inverse using a KSP method,
but faster results may be obtained by manipulating the structure of the Jacobian and providing an inverse.
The fifth argument is the function pointer, and the sixth argument is an optional user-defined context. Since
no solve is performed with the design Jacobian, there is no need to provide preconditioner or inverse matrices.
For nonlinear least squares applications, we are solving the optimization problem
1
min ||r(x)||22 .
x 2
For these problems, the objective function value should be computed as a vector of residuals, r(x), computed
with a function of the form
routine. If required by the algorithm, the Jacobian of the residual, J = ∂r(x)/∂x, should be computed with
a function of the form
routine.
Complementarity
Complementarity applications have equality constraints in the form of nonlinear equations C(X) = 0, where
C : Rn → Rm . These constraints should be specified in a routine written by the user with the form
that evaluates C(X). The first argument of this routine is a TAO Solver object. The second argument is the
variable vector X at which the constraint function should be evaluated. The third argument is the output
vector of function values C(X), and the fourth argument is a pointer to a user-defined context.
This routine and the user-defined context must be registered with TAO by using the
command. In this command, the first argument is TAO Solver object, the second argument is vector in
which to store the function values, the third argument is the user-defined routine that evaluates C(X), and
the fourth argument is a pointer to a user-defined context that will be passed back to the user.
The Jacobian of the function is the matrix in Rm×n such that each column contains the partial derivatives
of f with respect to one variable. The evaluation of the Jacobian of C should be performed in a routine of
the form
In this function, the first argument is the TAO Solver object and the second argument is the variable vector at
which to evaluate the Jacobian matrix. The third argument is the Jacobian matrix, and the sixth argument
is a pointer to a user-defined context. Since the Jacobian matrix may be used in solving a system of linear
equations, a preconditioner for the matrix may be needed. The fourth argument is the matrix that will
be used for preconditioning the linear system; in most cases, this matrix will be the same as the Hessian
matrix. The fifth argument is the flag used to set the Jacobian matrix and linear solver in the routine
KSPSetOperators().
This routine should be specified to TAO by using the
command. The first argument is the TAO Solver object; the second and third arguments are the Mat objects
in which the Jacobian will be stored and the Mat object that will be used for the preconditioning (they may
be the same), respectively. The fourth argument is the function pointer; and the fifth argument is an optional
user-defined context. The Jacobian matrix should be created in a way such that the product of it and the
variable vector can be stored in the constraint vector.
TAO includes a variety of optimization algorithms for several classes of problems (unconstrained, bound-
constrained, and PDE-constrained minimization, nonlinear least-squares, and complementarity). The TAO
algorithms for solving these problems are detailed in this section, a particular algorithm can chosen by
using the TaoSetType() function or using the command line arguments -tao_type <name>. For those
interested in extending these algorithms or using new ones, please see Adding a Solver for more information.
Unconstrained Minimization
Unconstrained minimization is used to minimize a function of many variables without any constraints on
the variables, such as bounds. The methods available in TAO for solving these problems can be classified
according to the amount of derivative information required:
1. Function evaluation only – Nelder-Mead method (tao_nm)
2. Function and gradient evaluations – limited-memory, variable-metric method (tao_lmvm) and non-
linear conjugate gradient method (tao_cg)
3. Function, gradient, and Hessian evaluations – Newton Krylov methods: Netwon line search (tao_nls),
Newton trust-region (tao_ntr), and Newton trust-region line-search (tao_ntl)
The best method to use depends on the particular problem being solved and the accuracy required in the
solution. If a Hessian evaluation routine is available, then the Newton line search and Newton trust-region
methods will likely perform best. When a Hessian evaluation routine is not available, then the limited-
memory, variable-metric method is likely to perform best. The Nelder-Mead method should be used only as
a last resort when no gradient information is available.
Each solver has a set of options associated with it that can be set with command line arguments. These
algorithms and the associated options are briefly discussed in this section.
Newton-Krylov Methods
TAO features three Newton-Krylov algorithms, separated by their globalization methods for unconstrained
optimization: line search (NLS), trust region (NTR), and trust region with a line search (NTL). They are
available via the TAO solvers TAONLS, TAONTR and TAONTL, respectively, or the -tao_type nls/ntr/ntl
flag.
The Newton line search method solves the symmetric system of equations
Hk dk = −gk
to obtain a step dk , where Hk is the Hessian of the objective function at xk and gk is the gradient of the
objective function at xk . For problems where the Hessian matrix is indefinite, the perturbed system of
equations
is solved to obtain the direction, where ρk is a positive constant. If the direction computed is not a descent
direction, the (scaled) steepest descent direction is used instead. Having obtained the direction, a Moré-
Thuente line search is applied to obtain a step length, τk , that approximately solves the one-dimensional
optimization problem
min f (xk + τ dk ).
τ
The Newton line search method can be selected by using the TAO solver tao_nls. The options available
for this solver are listed in Table 2.17. For the best efficiency, function and gradient evaluations should be
performed simultaneously when using this algorithm.
The system of equations is approximately solved by applying the conjugate gradient method, Nash conjugate
gradient method, Steihaug-Toint conjugate gradient method, generalized Lanczos method, or an alternative
Krylov subspace method supplied by PETSc. The method used to solve the systems of equations is specified
with the command line argument -tao_nls_ksp_type <cg,nash,stcg,gltr,gmres,...>; stcg is
the default. See the PETSc manual for further information on changing the behavior of the linear system
solvers.
A good preconditioner reduces the number of iterations required to solve the linear system of equations. For
the conjugate gradient methods and generalized Lanczos method, this preconditioner must be symmetric
and positive definite. The available options are to use no preconditioner, the absolute value of the diagonal
of the Hessian matrix, a limited-memory BFGS approximation to the Hessian matrix, or one of the other
preconditioners provided by the PETSc package. These preconditioners are specified by the command line
arguments -tao_nls_pc_type <none,jacobi,icc,ilu,lmvm>, respectively. The default is the lmvm
preconditioner, which uses a BFGS approximation of the inverse Hessian. See the PETSc manual for further
information on changing the behavior of the preconditioners.
The perturbation ρk is added when the direction returned by the Krylov subspace method is not a descent
direction, the Krylov method diverged due to an indefinite preconditioner or matrix, or a direction of negative
curvature was found. In the last two cases, if the step returned is a descent direction, it is used during the line
search. Otherwise, a steepest descent direction is used during the line search. The perturbation is decreased
as long as the Krylov subspace method reports success and increased if further problems are encountered.
There are three cases: initializing, increasing, and decreasing the perturbation. These cases are described
below.
1. If ρk is zero and a problem was detected with either the direction or the Krylov subspace method, the
perturbation is initialized to
where g(xk ) is the gradient of the objective function and imin is set with the command line argument
-tao_nls_imin <real> with a default value of 10−4 , imfac by -tao_nls_imfac with a default
value of 0.1, and imax by -tao_nls_imax with a default value of 100. When using the gltr method
to solve the system of equations, an estimate of the minimum eigenvalue λ1 of the Hessian matrix is
available. This value is used to initialize the perturbation to ρk+1 = max {ρk+1 , −λ1 } in this case.
2. If ρk is nonzero and a problem was detected with either the direction or Krylov subspace method, the
perturbation is increased to
where g(xk ) is the gradient of the objective function and pgfac is set with the command line argument
-tao_nls_pgfac with a default value of 10, pmgfac by -tao_nls_pmgfac with a default value
of 0.1, and pmax by -tao_nls_pmax with a default value of 100.
3. If ρk is nonzero and no problems were detected with either the direction or Krylov subspace method,
the perturbation is decreased to
where g(xk ) is the gradient of the objective function, psfac is set with the command line argument
-tao_nls_psfac with a default value of 0.4, and pmsfac is set by -tao_nls_pmsfac with a
default value of 0.1. Moreover, if ρk+1 < pmin, then ρk+1 = 0, where pmin is set with the command
line argument -tao_nls_pmin and has a default value of 10−12 .
Near a local minimizer to the unconstrained optimization problem, the Hessian matrix will be positive-
semidefinite; the perturbation will shrink toward zero, and one would eventually observe a superlinear
convergence rate.
When using nash, stcg, or gltr to solve the linear systems of equation, a trust-region radius needs to be
initialized and updated. This trust-region radius simultaneously limits the size of the step computed and
reduces the number of iterations of the conjugate gradient method. The method for initializing the trust-
region radius is set with the command line argument -tao_nls_init_type <constant,direction,
interpolation>; interpolation, which chooses an initial value based on the interpolation scheme
found in [CGT00], is the default. This scheme performs a number of function and gradient evaluations to
determine a radius such that the reduction predicted by the quadratic model along the gradient direction
coincides with the actual reduction in the nonlinear function. The iterate obtaining the best objective
function value is used as the starting point for the main line search algorithm. The constant method
initializes the trust-region radius by using the value specified with the -tao_trust0 <real> command line
argument, where the default value is 100. The direction technique solves the first quadratic optimization
problem by using a standard conjugate gradient method and initializes the trust region to ks0 k.
The method for updating the trust-region radius is set with the command line argument
-tao_nls_update_type <step,reduction,interpolation>; step is the default. The step
method updates the trust-region radius based on the value of τk . In particular,
ω1 min(∆k , kdk k) if τk ∈ [0, ν1 )
ω2 min(∆k , kdk k) if τk ∈ [ν1 , ν2 )
∆k+1 = ω3 ∆k if τk ∈ [ν2 , ν3 )
max(∆ k , ω 4 kd k k) if τk ∈ [ν3 , ν4 )
max(∆k , ω5 kdk k) if τk ∈ [ν4 , ∞),
where 0 < ω1 < ω2 < ω3 = 1 < ω4 < ω5 and 0 < ν1 < ν2 < ν3 < ν4 are constants. The reduction
method computes the ratio of the actual reduction in the objective function to the reduction predicted by
the quadratic model for the full step, κk = fq(x
(xk )−f (xk +dk )
k )−q(xk +dk )
, where qk is the quadratic model. The radius is
then updated as
α1 min(∆k , kdk k) if κk ∈ (−∞, η1 )
α2 min(∆k , kdk k) if κk ∈ [η1 , η2 )
∆k+1 = α3 ∆k if κk ∈ [η2 , η3 )
max(∆ k , α 4 kdk k) if κk ∈ [η3 , η4 )
max(∆k , α5 kdk k) if κk ∈ [η4 , ∞),
where 0 < α1 < α2 < α3 = 1 < α4 < α5 and 0 < η1 < η2 < η3 < η4 are constants. The interpolation
method uses the same interpolation mechanism as in the initialization to compute a new value for the
trust-region radius.
This algorithm will be deprecated in the next version and replaced by the Bounded Newton Line Search
(BNLS) algorithm that can solve both bound constrained and unconstrained problems.
The Newton trust-region method solves the constrained quadratic programming problem
1 T
mind 2 d Hk d + gkT d
subject to kdk ≤ ∆k
to obtain a direction dk , where Hk is the Hessian of the objective function at xk , gk is the gradient of the
objective function at xk , and ∆k is the trust-region radius. If xk + dk sufficiently reduces the nonlinear
objective function, then the step is accepted, and the trust-region radius is updated. However, if xk + dk
does not sufficiently reduce the nonlinear objective function, then the step is rejected, the trust-region radius
is reduced, and the quadratic program is re-solved by using the updated trust-region radius. The Newton
trust-region method can be set by using the TAO solver tao_ntr. The options available for this solver are
listed in Table 2.18. For the best efficiency, function and gradient evaluations should be performed separately
when using this algorithm.
The quadratic optimization problem is approximately solved by applying the Nash or Steihaug-Toint
conjugate gradient methods or the generalized Lanczos method to the symmetric system of equations
Hk d = −gk . The method used to solve the system of equations is specified with the command line ar-
gument -tao_ntr_ksp_type <nash,stcg,gltr>; stcg is the default. See the PETSc manual for
further information on changing the behavior of these linear system solvers.
A good preconditioner reduces the number of iterations required to compute the direction. For the Nash
and Steihaug-Toint conjugate gradient methods and generalized Lanczos method, this preconditioner must
be symmetric and positive definite. The available options are to use no preconditioner, the absolute value
of the diagonal of the Hessian matrix, a limited-memory BFGS approximation to the Hessian matrix, or
one of the other preconditioners provided by the PETSc package. These preconditioners are specified by
the command line argument -tao_ntr_pc_type <none,jacobi,icc,ilu,lmvm>, respectively. The
default is the lmvm preconditioner. See the PETSc manual for further information on changing the behavior
of the preconditioners.
The method for computing an initial trust-region radius is set with the command line arguments
-tao_ntr_init_type <constant,direction,interpolation>; interpolation, which chooses
an initial value based on the interpolation scheme found in [CGT00], is the default. This scheme performs a
number of function and gradient evaluations to determine a radius such that the reduction predicted by the
quadratic model along the gradient direction coincides with the actual reduction in the nonlinear function.
The iterate obtaining the best objective function value is used as the starting point for the main trust-region
algorithm. The constant method initializes the trust-region radius by using the value specified with the
-tao_trust0 <real> command line argument, where the default value is 100. The direction tech-
nique solves the first quadratic optimization problem by using a standard conjugate gradient method and
initializes the trust region to ks0 k.
The method for updating the trust-region radius is set with the command line arguments
-tao_ntr_update_type <reduction,interpolation>; reduction is the default. The reduc-
tion method computes the ratio of the actual reduction in the objective function to the reduction predicted
by the quadratic model for the full step, κk = fq(x
(xk )−f (xk +dk )
k )−q(xk +dk )
, where qk is the quadratic model. The radius
is then updated as
α1 min(∆k , kdk k) if κk ∈ (−∞, η1 )
α2 min(∆k , kdk k) if κk ∈ [η1 , η2 )
∆k+1 = α3 ∆k if κk ∈ [η2 , η3 )
max(∆ k , α 4 kdk k) if κk ∈ [η3 , η4 )
max(∆k , α5 kdk k) if κk ∈ [η4 , ∞),
where 0 < α1 < α2 < α3 = 1 < α4 < α5 and 0 < η1 < η2 < η3 < η4 are constants. The interpolation
method uses the same interpolation mechanism as in the initialization to compute a new value for the
trust-region radius.
This algorithm will be deprecated in the next version and replaced by the Bounded Newton Trust Region
(BNTR) algorithm that can solve both bound constrained and unconstrained problems.
NTL safeguards the trust-region globalization such that a line search is used in the event that the step is
initially rejected by the predicted versus actual decrease comparison. If the line search fails to find a viable
step length for the Newton step, it falls back onto a scaled gradient or a gradient descent step. The trust
radius is then modified based on the line search step length.
This algorithm will be deprecated in the next version and replaced by the Bounded Newton Trust Region
with Line Search (BNTL) algorithm that can solve both bound constrained and unconstrained problems.
The limited-memory, variable-metric method (LMVM) computes a positive definite approximation to the
Hessian matrix from a limited number of previous iterates and gradient evaluations. A direction is then
obtained by solving the system of equations
Hk dk = −∇f (xk ),
where Hk is the Hessian approximation obtained by using the BFGS update formula. The inverse of Hk can
readily be applied to obtain the direction dk . Having obtained the direction, a Moré-Thuente line search is
applied to compute a step length, τk , that approximately solves the one-dimensional optimization problem
min f (xk + τ dk ).
τ
The current iterate and Hessian approximation are updated, and the process is repeated until the method
converges. This algorithm is the default unconstrained minimization solver and can be selected by using
the TAO solver tao_lmvm. For best efficiency, function and gradient evaluations should be performed
simultaneously when using this algorithm.
The primary factors determining the behavior of this algorithm are the type of Hessian approximation used,
the number of vectors stored for the approximation and the initialization/scaling of the approximation.
These options can be configured using the -tao_lmvm_mat_lmvm prefix. For further detail, we refer the
reader to the MATLMVM matrix type definitions in the PETSc Manual.
The LMVM algorithm also allows the user to define a custom initial Hessian matrix H0,k through the
interface function TaoLMVMSetH0(). This user-provided initialization overrides any other scalar or diagonal
initialization inherent to the LMVM approximation. The provided H0,k must be a PETSc Mat type object
that represents a positive-definite matrix. The approximation prefers MatSolve() if the provided matrix
has MATOP_SOLVE implemented. Otherwise, MatMult() is used in a KSP solve to perform the inversion
of the user-provided initial Hessian.
In applications where TaoSolve() on the LMVM algorithm is repeatedly called to solve similar or related
problems, -tao_lmvm_recycle flag can be used to prevent resetting the LMVM approximation between
subsequent solutions. This recycling also avoids one extra function and gradient evaluation, instead re-using
the values already computed at the end of the previous solution.
This algorithm will be deprecated in the next version and replaced by the Bounded Quasi-Newton Line
Search (BQNLS) algorithm that can solve both bound constrained and unconstrained problems.
The nonlinear conjugate gradient method can be viewed as an extension of the conjugate gradient method
for solving symmetric, positive-definite linear systems of equations. This algorithm requires only function
and gradient evaluations as well as a line search. The TAO implementation uses a Moré-Thuente line search
to obtain the step length. The nonlinear conjugate gradient method can be selected by using the TAO solver
tao_cg. For the best efficiency, function and gradient evaluations should be performed simultaneously when
using this algorithm.
Five variations are currently supported by the TAO implementation: the Fletcher-Reeves method, the
Polak-Ribiére method, the Polak-Ribiére-Plus method [NW99], the Hestenes-Stiefel method, and the Dai-
Yuan method. These conjugate gradient methods can be specified by using the command line argument
-tao_cg_type <fr,pr,prp,hs,dy>, respectively. The default value is prp.
The conjugate gradient method incorporates automatic restarts when successive gradients are not sufficiently
orthogonal. TAO measures the orthogonality by dividing the inner product of the gradient at the current
point and the gradient at the previous point by the square of the Euclidean norm of the gradient at the
current point. When the absolute value of this ratio is greater than η, the algorithm restarts using the gradient
direction. The parameter η can be set by using the command line argument -tao_cg_eta <real>; 0.1 is
the default value.
This algorithm will be deprecated in the next version and replaced by the Bounded Nonlinear Conjugate
Gradient (BNCG) algorithm that can solve both bound constrained and unconstrained problems.
The Nelder-Mead algorithm [NM65] is a direct search method for finding a local minimum of a function f (x).
This algorithm does not require any gradient or Hessian information of f and therefore has some expected
advantages and disadvantages compared to the other TAO solvers. The obvious advantage is that it is easier
to write an application when no derivatives need to be calculated. The downside is that this algorithm can
be slow to converge or can even stagnate, and it performs poorly for large numbers of variables.
This solver keeps a set of N + 1 sorted vectors x1 , x2 , . . . , xN +1 and their corresponding objective function
values f1 ≤ f2 ≤ . . . ≤ fN +1 . At each iteration, xN +1 is removed from the set and replaced with
1 X
N
x(µ) = (1 + µ) xi − µxN +1 ,
N i=1
where µ can be one of µ0 , 2µ0 , 21 µ0 , − 12 µ0 depending on the values of each possible f (x(µ)).
The algorithm terminates when the residual fN +1 − f1 becomes sufficiently small. Because of the way new
vectors can be added to the sorted set, the minimum function value and/or the residual may not be impacted
at each iteration.
Two options can be set specifically for the Nelder-Mead algorithm:
-tao_nm_lambda <value>
sets the initial set of vectors (x0 plus value in each coordinate direction); the default value is 1.
-tao_nm_mu <value>
sets the value of µ0 ; the default is µ0 = 1.
Bound-Constrained Optimization
min f (x)
x
subject to l ≤ x ≤ u.
These solvers use the bounds on the variables as well as objective function, gradient, and possibly Hessian
information.
For any unbounded variables, the bound value for the associated index can be set to PETSC_INFINITY for
the upper bound and PETSC_NINFINITY for the lower bound. If all bounds are set to infinity, then the
bounded algorithms are equivalent to their unconstrained counterparts.
Before introducing specific methods, we will first define two projection operations used by all bound con-
strained algorithms.
• Gradient projection:
0 if (x ≤ li ∧ gi > 0) ∨ (x ≥ ui ∧ gi < 0)
P(g) =
gi otherwise
• Bound projection:
li if xi < li
B(x) = ui if xi > ui
xi otherwise
TAO features three bounded Newton-Krylov (BNK) class of algorithms, separated by their globalization
methods: projected line search (BNLS), trust region (BNTR), and trust region with a projected line search
fall-back (BNTL). They are available via the TAO solvers TAOBNLS, TAOBNTR and TAOBNTL, respectively,
or the -tao_type bnls/bntr/bntl flag.
The BNK class of methods use an active-set approach to solve the symmetric system of equations,
Hk pk = −gk ,
only for inactive variables in the interior of the bounds. The active-set estimation is based on Bertsekas
[Ber82] with the following variable index categories:
lower bounded : L(x) = {i : xi ≤ li + ϵ ∧ g(x)i > 0},
upper bounded : U(x) = {i : xi ≥ ui + ϵ ∧ g(x)i < 0},
fixed : F(x) = {i : li S = ui }, S
active-set : A(x) = {L(x) U(x) F(x)},
inactive-set : I(x) = {1, 2, . . . , n} \ A(x).
At each iteration, the bound tolerance is estimated as ϵk+1 = min(ϵk , ||wk ||2 ) with wk = xk − B(xk −
βDk gk ), where the diagonal matrix Dk is an approximation of the Hessian inverse Hk−1 . The initial bound
tolerance ϵ0 and the step length β have default values of 0.001 and can be adjusted using -tao_bnk_as_tol
and -tao_bnk_as_step flags, respectively. The active-set estimation can be disabled using the option
-tao_bnk_as_type none, in which case the algorithm simply uses the current iterate with no bound
tolerances to determine which variables are actively bounded and which are free.
BNK algorithms invert the reduced Hessian using a Krylov iterative method. Trust-region conjugate gradient
methods (KSPNASH, KSPSTCG, and KSPGLTR) are required for the BNTR and BNTL algorithms, and recom-
mended for the BNLS algorithm. The preconditioner type can be changed using the -tao_bnk_pc_type
none/ilu/icc/jacobi/lmvm. The lmvm option, which is also the default, preconditions the Krylov so-
lution with a MATLMVM matrix. The remaining supported preconditioner types are default PETSc types. If
Jacobi is selected, the diagonal values are safeguarded to be positive. icc and ilu options produce good
results for problems with dense Hessians. The LMVM and Jacobi preconditioners are also used as the ap-
proximate inverse-Hessian in the active-set estimation. If neither are available, or if the Hessian matrix does
not have MATOP_GET_DIAGONAL defined, then the active-set estimation falls back onto using an identity
matrix in place of Dk (this is equivalent to estimating the active-set using a gradient descent step).
A special option is available to accelerate the convergence of the BNK algorithms by taking a finite number of
BNCG iterations at each Newton iteration. By default, the number of BNCG iterations is set to zero and the
algorithms do not take any BNCG steps. This can be changed using the option flag -tao_bnk_max_cg_its
<i>. While this reduces the number of Newton iterations, in practice it simply trades off the Hessian
evaluations in the BNK solver for more function and gradient evaluations in the BNCG solver. However,
it may be useful for certain types of problems where the Hessian evaluation is disproportionately more
expensive than the objective function or its gradient.
BNLS safeguards the Newton step by falling back onto a BFGS, scaled gradient, or gradient steps based on
descent direction verifications. For problems with indefinite Hessian matrices, the step direction is calculated
using a perturbed system of equations,
(Hk + ρk I)pk = −gk ,
where ρk is a dynamically adjusted positive constant. The step is globalized using a projected Moré-Thuente
line search. If a trust-region conjugate gradient method is used for the Hessian inversion, the trust radius is
modified based on the line search step length.
BNTR globalizes the Newton step using a trust region method based on the predicted versus actual reduction
in the cost function. The trust radius is increased only if the accepted step is at the trust region boundary.
The reduction check features a safeguard for numerical values below machine epsilon, scaled by the latest
function value, where the full Newton step is accepted without modification.
BNTL safeguards the trust-region globalization such that a line search is used in the event that the step is
initially rejected by the predicted versus actual decrease comparison. If the line search fails to find a viable
step length for the Newton step, it falls back onto a scaled gradient or a gradient descent step. The trust
radius is then modified based on the line search step length.
The BQNLS algorithm uses the BNLS infrastructure, but replaces the step calculation with a direct inverse
application of the approximate Hessian based on quasi-Newton update formulas. No Krylov solver is used
in the solution, and therefore the quasi-Newton method chosen must guarantee a positive-definite Hessian
approximation. This algorithm is available via tao_type bqnls.
Bounded Quasi-Newton-Krylov
BQNK algorithms use the BNK infrastructure, but replace the exact Hessian with a quasi-Newton approx-
imation. The matrix-free forward product operation based on quasi-Newton update formulas are used in
conjunction with Krylov solvers to compute step directions. The quasi-Newton inverse application is used to
precondition the Krylov solution, and typically helps converge to a step direction in O(10) iterations. This
approach is most useful with quasi-Newton update types such as Symmetric Rank-1 that cannot strictly
guarantee positive-definiteness. The BNLS framework with Hessian shifting, or the BNTR framework with
trust region safeguards, can successfully compensate for the Hessian approximation becoming indefinite.
Similar to the full Newton-Krylov counterpart, BQNK algorithms come in three forms separated by the
globalization technique: line search (BQNKLS), trust region (BQNKTR) and trust region w/ line search
fall-back (BQNKTL). These algorithms are available via tao_type <bqnkls, bqnktr, bqnktl>.
BNCG extends the unconstrained nonlinear conjugate gradient algorithm to bound constraints via gradient
projections and a bounded Moré-Thuente line search.
Like its unconstrained counterpart, BNCG offers gradient descent and a variety of CG updates: Fletcher-
Reeves, Polak-Ribiére, Polak-Ribiére-Plus, Hestenes-Stiefel, Dai-Yuan, Hager-Zhang, Dai-Kou, Kou-Dai,
and the Self-Scaling Memoryless (SSML) BFGS, DFP, and Broyden methods. These methods can be
specified by using the command line argument -tao_bncg_type <gd,fr,pr,prp,hs,dy,hz,dk,kd,
ssml_bfgs,ssml_dfp,ssml_brdn>, respectively. The default value is ssml_bfgs. We have scalar
preconditioning for these methods, and it is controlled by the flag tao_bncg_alpha. To disable rescaling,
use α = −1.0, otherwise α ∈ [0, 1]. BNCG is available via the TAO solver TAOBNCG or the -tao_type
bncg flag.
Some individual methods also contain their own parameters. The Hager-Zhang and Dou-Kai methods have
a parameter that determines the minimum amount of contribution the previous search direction gives to
the next search direction. The flags are -tao_bncg_hz_eta and -tao_bncg_dk_eta, and by default
are set to 0.4 and 0.5 respectively. The Kou-Dai method has multiple parameters. -tao_bncg_zeta
serves the same purpose as the previous two; set to 0.1 by default. There is also a parameter to scale the
contribution of yk ≡ ∇f (xk ) − ∇f (xk−1 ) in the search direction update. It is controlled by -tao_bncg_xi,
and is equal to 1.0 by default. There are also times where we want to maximize the descent as measured
by ∇f (xk )T dk , and that may be done by using a negative value of ξ; this achieves better performance
when not using the diagonal preconditioner described next. This is enabled by default, and is controlled
by -tao_bncg_neg_xi. Finally, the Broyden method has its convex combination parameter, set with
-tao_bncg_theta. We have this as 1.0 by default, i.e. it is by default the BFGS method. One can also
individually tweak the BFGS and DFP contributions using the multiplicative constants -tao_bncg_scale;
both are set to 1 by default.
All methods can be scaled using the parameter -tao_bncg_alpha, which continuously varies in [0, 1]. The
default value is set depending on the method from initial testing.
BNCG also offers a special type of method scaling. It employs Broyden diagonal scaling as an option for its
CG methods, turned on with the flag -tao_bncg_diag_scaling. Formulations for both the forward (reg-
ular) and inverse Broyden methods are developed, controlled by the flag -tao_bncg_mat_lmvm_forward.
It is set to True by default. Whether one uses the forward or inverse formulations depends on the method
being used. For example, in our preliminary computations, the forward formulation works better for the
SSML_BFGS method, but the inverse formulation works better for the Hestenes-Stiefel method. The con-
vex combination parameter for the Broyden scaling is controlled by -tao_bncg_mat_lmvm_theta, and
is 0 by default. We also employ rescaling of the Broyden diagonal, which aids the linesearch immensely.
The rescaling parameter is controlled by -tao_bncg_mat_lmvm_alpha, and should be ∈ [0, 1]. One can
disable rescaling of the Broyden diagonal entirely by setting -tao_bncg_mat_lmvm_sigma_hist 0.
One can also supply their own preconditioner, serving as a Hessian initialization to the above diagonal scaling.
The appropriate user function in the code is TaoBNCGSetH0(tao, H0) where H0 is the user-defined Mat
object that serves as a preconditioner. For an example of similar usage, see tao/tutorials/ex3.c.
The active set estimation uses the Bertsekas-based method described in Bounded Newton-Krylov Methods,
which can be deactivated using -tao_bncg_as_type none, in which case the algorithm will use the
current iterate to determine the bounded variables with no tolerances and no look-ahead step. As in the
BNK algorithm, the initial bound tolerance and estimator step length used in the Bertsekas method can be
set via -tao_bncg_as_tol and -tao_bncg_as_step, respectively.
In addition to automatic scaled gradient descent restarts under certain local curvature conditions, we
also employ restarts based on a check on descent direction such that ∇f (xk )T dk ∈ [−1011 , −10−9 ].
Furthermore, we allow for a variety of alternative restart strategies, all disabled by default. The
-tao_bncg_unscaled_restart flag allows one to disable rescaling of the gradient for gradient descent
steps. The -tao_bncg_spaced_restart flag tells the solver to restart every M n iterations, where n is
the problem dimension and M is a constant determined by -tao_bncg_min_restart_num and is 6 by
default. We also have dynamic restart strategies based on checking if a function is locally quadratic; if so,
go do a gradient descent step. The flag is -tao_bncg_dynamic_restart, disabled by default since the
CG solver usually does better in those cases anyway. The minimum number of quadratic-like steps before a
restart is set using -tao_bncg_min_quad and is 6 by default.
Constrained solvers solve optimization problems that incorporate either or both equality and inequality
constraints, and may optionally include bounds on solution variables.
The TAOADMM algorithm is intended to blend the decomposability of dual ascent with the superior con-
vergence properties of the method of multipliers. [BPC+11] The algorithm solves problems in the form
where x ∈ Rn , z ∈ Rm , A ∈ Rp×n , B ∈ Rp×m , and c ∈ Rp . Essentially, ADMM is a wrapper over two TAO
solver, one for f (x), and one for g(z). With method of multipliers, one can form the augmented Lagrangian
min f (x)
x
subject to g(x) = 0
h(x) ≥ 0
l≤x≤u
where g(x) are equality constraints, h(x) are inequality constraints and l and u are lower and upper bounds
on the optimization variables, respectively.
TAOALMM converts the above general constrained problem into a sequence of bound constrained problems
at each outer iteration k = 1, 2, . . .
min L(x, λk )
x
subject to l≤x≤u
where L(x, λk ) is the augmented Lagrangian merit function and λk is the Lagrange multiplier estimates at
outer iteration k.
TAOALMM offers two versions of the augmented Lagrangian formulation: the canonical Hestenes-Powell
augmented Lagrangian [Hes69] [Pow69] with inequality constrained converted to equality constraints via
slack variables, and the slack-less Powell-Hestenes-Rockafellar formulation [Roc74] that utilizes a pointwise
max() on the inequality constraints. For most applications, the canonical Hestenes-Powell formulation is
likely to perform better. However, the PHR formulation may be desirable for problems featuring very large
numbers of inequality constraints as it avoids inflating the dimension of the subproblem with slack variables.
The inner subproblem is solved using a nested bound-constrained first-order TAO solver. By default,
TAOALM uses a quasi-Newton-Krylov trust-region method (TAOBQNKTR). Other first-order methods
such as TAOBNCG and TAOBQNLS are also appropriate, but a trust-region globalization is strongly rec-
ommended for most applications.
The TAOPDIPM method (-tao_type pdipm) implements a primal-dual interior point method for solving
general nonlinear programming problems of the form
min f (x)
x
subject to g(x) = 0 (2.6)
h(x) ≥ 0
x− ≤ x ≤ x+
Here, f (x) is the nonlinear objective function, g(x), h(x) are the equality and inequality constraints, and x−
and x+ are the lower and upper bounds on decision variables x.
PDIPM converts the inequality constraints to equalities using slack variables z and a log-barrier term, which
transforms (2.6) to
X
nci
min f (x) − µ ln zi
i=1
s.t. (2.7)
ce(x) = 0
ci(x) − z = 0
Here, ce(x) is set of equality constraints that include g(x) and fixed decision variables, i.e., x− = x = x+ .
Similarly, ci(x) are inequality constraints including h(x) and lower/upper/box-constraints on x. µ is a
parameter that is driven to zero as the optimization progresses.
The Lagrangian for (2.7)) is
X
nci
Lµ (x, λce , λci , z) = f (x) + λTce ce(x) − λTci (ci(x) − z) − µ ln zi (2.8)
i=1
where, λce and λci are the Lagrangian multipliers for the equality and inequality constraints, respectively.
The first order KKT conditions for optimality are as follows
∇f (x) + ∇ce(x)T λce − ∇ci(x)T λci
ce(x)
∇Lµ (x, λce , λci , z) =
=0
(2.9)
ci(x) − z
ZΛci e − µe
(2.9) is solved iteratively using Newton’s method using PETSc’s SNES object. After each Newton iteration, a
line-search is performed to update x and enforce z, λci ≥ 0. The barrier parameter µ is also updated after each
Newton iteration. The Newton update is obtained by solving the second-order KKT system Hd = −∇Lµ .
Here,H is the Hessian matrix of the KKT system. For interior-point methods such as PDIPM, the Hessian
matrix tends to be ill-conditioned, thus necessitating the use of a direct solver. We recommend using LU
preconditioner -pc_type lu and using direct linear solver packages such SuperLU_Dist or MUMPS.
PDE-Constrained Optimization
min f (u, v)
u,v
subject to g(u, v) = 0,
where the state variable u is the solution to the discretized partial differential equation defined by g and
parametrized by the design variable v, and f is an objective function. The Lagrange multipliers on the
constraint are denoted by y. This method is set by using the linearly constrained augmented Lagrangian
TAO solver tao_lcl.
We make two main assumptions when solving these problems: the objective function and PDE constraints
have been discretized so that we can treat the optimization problem as finite dimensional and ∇u g(u, v) is
invertible for all u and v.
Given the current iterate (uk , vk , yk ), the linearly constrained augmented Lagrangian method approximately
solves the optimization problem
Newton Step
The Newton direction is obtained by fixing the design variables at their current value and solving the
linearized constraint for the state variables. In particular, we solve the system of equations
Ak du = −gk
to obtain a direction du. We need a direction that provides sufficient descent for the merit function
1
kg(u, v)k2 .
2
That is, we require gkT Ak du < 0.
If the Newton direction is a descent direction, then we choose a penalty parameter ρk so that du is also a
sufficient descent direction for the augmented Lagrangian merit function. We then find α to approximately
minimize the augmented Lagrangian merit function along the Newton direction.
We can enforce either the sufficient decrease condition or the Wolfe conditions during the search procedure.
The new point,
uk+ 12 = uk + αk du
vk+ 12 = vk ,
Ak (uk+ 12 − uk ) + Bk (vk+ 12 − vk ) + αk gk = 0.
If the Newton direction computed does not provide descent for the merit function, then we can use the
steepest descent direction du = −ATk gk during the search procedure. However, the implication that the
intermediate point approximately satisfies the linear constraint is no longer true.
We are now ready to compute a reduced-space step for the modified optimization problem:
du = −A−1
k (Bk dv + αk gk ).
which is equivalent to
We apply one step of a limited-memory quasi-Newton method to this problem. The direction is obtain by
solving the quadratic problem
1 T T
min 2 dv H̃k dv + g̃k+ 1 dv,
dv 2
where H̃k is the limited-memory quasi-Newton approximation to the reduced Hessian matrix, a positive-
definite matrix, and g̃k+ 12 is the reduced gradient.
yk+ 12 = A−T
k ck+ 12
Because the Hessian approximation is positive definite and we know its inverse, we obtain the direction
dv = −Hk−1 g̃k+ 21
and recover the full-space direction from one linearized forward solve,
du = −A−1
k Bk dv.
Having the full-space direction, which satisfies the linear constraint, we now approximately minimize the
augmented Lagrangian merit function along the direction.
We enforce the Wolfe conditions during the search procedure. The new point is
uk+1 = uk+ 21 + βk du
vk+1 = vk+ 12 + βk dv.
yk+1 = A−T
k ck+1
g̃k+1 = dk+1 − yk+1
T
Bk ,
where ck+1 = ∇u f˜k (uk+1 , vk+1 ) and dk+1 = ∇v f˜k (uk+1 , vk+1 ). The multipliers yk+1 become the multipliers
used in the next iteration of the code. The quantities vk+ 12 , vk+1 , g̃k+ 12 , and g̃k+1 are used to update Hk
to obtain the limited-memory quasi-Newton approximation to the reduced Hessian matrix used in the next
iteration of the code. The update is skipped if it cannot be performed.
Nonlinear Least-Squares
The TAOBRGN algorithms is a Gauss-Newton method is used to iteratively solve nonlinear least squares
problem with the iterations
where r(x) is the least-squares residual vector, Jk = ∂r(xk )/∂x is the Jacobian of the residual, and αk is the
step length parameter. In other words, the Gauss-Newton method approximates the Hessian of the objective
as Hk ≈ (JkT Jk ) and the gradient of the objective as gk ≈ −Jk r(xk ). The least-squares Jacobian, J, should
be provided to Tao using TaoSetJacobianResidual() routine.
The BRGN (-tao_type brgn) implementation adds a regularization term β(x) such that
1
min ||R(x)||22 + λβ(x),
x 2
where λ is the scalar weight of the regularizer. BRGN provides two default implementations for β(x):
• L2-norm - β(x) = 12 ||xk ||22
• L2-norm Proximal Point - β(x) = 12 ||xk − xk−1 ||22
P p
• L1-norm with Dictionary - β(x) = ||Dx||1 ≈ i yi2 + ϵ2 − ϵ where y = Dx and ϵ is the smooth
approximation parameter.
The regularizer weight can be controlled with either TaoBRGNSetRegularizerWeight() or
-tao_brgn_regularizer_weight command line option, while the smooth approximation parameter
can be set with either TaoBRGNSetL1SmoothEpsilon() or -tao_brgn_l1_smooth_epsilon. For
the L1-norm term, the user can supply a dictionary matrix with TaoBRGNSetDictionaryMatrix(). If
no dictionary is provided, the dictionary is assumed to be an identity matrix and the regularizer reduces to
a sparse solution term.
The regularization selection can be made using the command line option
-tao_brgn_regularization_type <l2pure, l2prox, l1dict, user> where the user
option allows the user to define a custom C2-continuous regularization term. This custom term can be
defined by using the interface functions:
• TaoBRGNSetRegularizerObjectiveAndGradientRoutine() - Provide user-call back for eval-
uating the function value and gradient evaluation for the regularization term.
• TaoBRGNSetRegularizerHessianRoutine() - Provide user call-back for evaluating the Hessian
of the regularization term.
POUNDERS
One algorithm for solving the least squares problem ((2.10)) when the Jacobian of the residual vector F
is unavailable is the model-based POUNDERS (Practical Optimization Using No Derivatives for sums of
Squares) algorithm (tao_pounders). POUNDERS employs a derivative-free trust-region framework as
described in [CSV09] in order to converge to local minimizers. An example of this version of POUNDERS
applied to a practical least-squares problem can be found in [KortelainenLesinskiMore+10].
In each iteration k, the algorithm maintains a model mk (x), described below, of the nonlinear least squares
function f centered about the current iterate xk .
If one assumes that the maximum number of function evaluations has not been reached and that
k∇mk (xk )k2 >gtol, the next point x+ to be evaluated is obtained by solving the trust-region subproblem
where ∆k is the current trust-region radius. By default we use a trust-region norm with p = ∞ and solve
((2.11)) with the BLMVM method described in Bound-constrained Limited-Memory Variable-Metric Method
(BLMVM). While the subproblem is a bound-constrained quadratic program, it may not be convex and the
BQPIP and GPCG methods may not solve the subproblem. Therefore, a bounded Newton-Krylov Method
should be used; the default is the BNTR algorithm. Note: BNTR uses its own internal trust region that
may interfere with the infinity-norm trust region used in the model problem ((2.11)).
The residual vector is then evaluated to obtain F (x+ ) and hence f (x+ ). The ratio of actual decrease to
predicted decrease,
f (xk ) − f (x+ )
ρk = ,
mk (xk ) − mk (x+ )
as well as an indicator, valid, on the model’s quality of approximation on the trust region is then used to
update the iterate,
x+ if ρk ≥ η1
xk+1 = x+ if 0 < ρk < η1 and valid=true
xk else,
where 0 < η1 < 1, 0 < γ0 < 1 < γ1 , 0 < ω1 < 1, and ∆max are constants.
If ρk ≤ 0 and valid is false, the iterate and trust-region radius remain unchanged after the above updates,
and the algorithm tests whether the direction x+ − xk improves the model. If not, the algorithm performs
an additional evaluation to obtain F (xk + dk ), where dk is a model-improving direction.
The iteration counter is then updated, and the next model mk is obtained as described next.
In each iteration, POUNDERS uses a subset of the available evaluated residual vectors {F (y1 ), F (y2 ), · · · }
to form an interpolatory quadratic model of each residual component. The m quadratic models
The gradients and Hessians of the models in (:eq:eq_models) are then used to construct the main model,
X
m m
X T
(i) (i) (i) (i)
mk (x) = f (xk ) + 2(x − xk )T Fi (xk )gk + (x − xk )T gk gk + Fi (xk )Hk (x − xk ). (2.13)
i=1 i=1
The process of forming these models also computes the indicator valid of the model’s local quality.
Parameters
POUNDERS supports the following parameters that can be set from the command line or PETSc options
file:
-tao_pounders_delta <delta>
The initial trust-region radius (> 0, real). This is used to determine the size of the initial neighborhood
within which the algorithm should look.
-tao_pounders_npmax <npmax>
The maximum number of interpolation points used (n + 2 ≤ npmax ≤ 0.5(n + 1)(n + 2)). This input
is made available to advanced users. We recommend the default value (npmax= 2n + 1) be used by
others.
-tao_pounders_gqt
Use the gqt algorithm to solve the subproblem ((2.11)) (uses p = 2) instead of BQPIP.
-pounders_subsolver
If the default BQPIP algorithm is used to solve the subproblem ((2.11)), the parameters of the sub-
problem solver can be accessed using the command line options prefix -pounders_subsolver_. For
example,
-pounders_subsolver_tao_gatol 1.0e-5
Convergence Notes
Because the gradient function is not provided to POUNDERS, the norm of the gradient of the objective
function is not available. Therefore, for convergence criteria, this norm is approximated by the norm of
the model gradient and used only when the model gradient is deemed to be a reasonable approximation of
the gradient of the objective. In practice, the typical grounds for termination for expensive derivative-free
problems is the maximum number of function evaluations allowed.
Complementarity
Mixed complementarity problems, or box-constrained variational inequalities, are related to nonlinear sys-
tems of equations. They are defined by a continuously differentiable function, F : Rn → Rn , and bounds,
ℓ ∈ {R ∪ {−∞}}n and u ∈ {R ∪ {∞}}n , on the variables such that ℓ ≤ u. Given this information, x∗ ∈ [ℓ, u]
is a solution to MCP(F , ℓ, u) if for each i ∈ {1, . . . , n} we have at least one of the following:
Fi (x∗ ) ≥ 0 if x∗i = ℓi
Fi (x∗ ) = 0 if ℓi < x∗i < ui
Fi (x∗ ) ≤ 0 if x∗i = ui .
Note that when ℓ = {−∞}n and u = {∞}n , we have a nonlinear system of equations, and ℓ = {0}n and
u = {∞}n correspond to the nonlinear complementarity problem [Cot64].
Simple complementarity conditions arise from the first-order optimality conditions from optimization [Kar39]
[KT51]. In the simple bound-constrained optimization case, these conditions correspond to MCP(∇f , ℓ, u),
where f : Rn → R is the objective function. In a one-dimensional setting these conditions are intuitive. If
the solution is at the lower bound, then the function must be increasing and ∇f ≥ 0. If the solution is at
the upper bound, then the function must be decreasing and ∇f ≤ 0. If the solution is strictly between the
bounds, we must be at a stationary point and ∇f = 0. Other complementarity problems arise in economics
and engineering [FP97], game theory [Nas50], and finance [HP98].
Evaluation routines for F and its Jacobian must be supplied prior to solving the application. The bounds,
[ℓ, u], on the variables must also be provided. If no starting point is supplied, a default starting point of all
zeros is used.
Semismooth Methods
TAO has two implementations of semismooth algorithms [MFF+01] [DeLucaFK96] [FFK97] for solving mixed
complementarity problems. Both are based on a reformulation of the mixed complementarity problem as a
nonsmooth system of equations using the Fischer-Burmeister function [Fis92]. A nonsmooth Newton method
is applied to the reformulated system to calculate a solution. The theoretical properties of such methods are
detailed in the aforementioned references.
The Fischer-Burmeister function, ϕ : R2 → R, is defined as
p
ϕ(a, b) := a2 + b2 − a − b.
ϕ(a, b) = 0 ⇔ a ≥ 0, b ≥ 0, ab = 0,
used when reformulating the mixed complementarity problem as the system of equations Φ(x) = 0, where
Φ : Rn → Rn . The reformulation is defined componentwise as
ϕ(xi − li , Fi (x)) if − ∞ < li < ui = ∞,
−ϕ(ui − xi , −Fi (x)) if − ∞ = li < ui < ∞,
Φi (x) := ϕ(xi − li , ϕ(ui − xi , −Fi (x))) if − ∞ < li < ui < ∞,
−F i (x) if − ∞ = li < ui = ∞,
li − x i if − ∞ < li = ui < ∞.
We note that Φ is not differentiable everywhere but satisfies a semismoothness property [Mif77] [Qi93] [QS93].
Furthermore, the natural merit function, Ψ(x) := 12 kΦ(x)k22 , is continuously differentiable.
The two semismooth TAO solvers both solve the system Φ(x) = 0 by applying a nonsmooth Newton method
with a line search. We calculate a direction, dk , by solving the system H k dk = −Φ(xk ), where H k is an
element of the B-subdifferential [QS93] of Φ at xk . If the direction calculated does not satisfy a suitable
descent condition, then we use the negative gradient of the merit function, −∇Ψ(xk ), as the search direction.
A standard Armijo search [Arm66] is used to find the new iteration. Nonmonotone searches [GLL86] are
also available by setting appropriate runtime options. See Line Searches for further details.
The first semismooth algorithm available in TAO is not guaranteed to remain feasible with respect to the
bounds, [ℓ, u], and is termed an infeasible semismooth method. This method can be specified by using the
tao_ssils solver. In this case, the descent test used is that
∇Ψ(xk )T dk ≤ −δkdk kρ .
Both δ > 0 and ρ > 2 can be modified by using the runtime options -tao_ssils_delta <delta> and
-tao_ssils_rho <rho>, respectively. By default, δ = 10−10 and ρ = 2.1.
An alternative is to remain feasible with respect to the bounds by using a projected Armijo line search. This
method can be specified by using the tao_ssfls solver. The descent test used is the same as above where
the direction in this case corresponds to the first part of the piecewise linear arc searched by the projected
line search. Both δ > 0 and ρ > 2 can be modified by using the runtime options -tao_ssfls_delta
<delta> and -tao_ssfls_rho <rho> respectively. By default, δ = 10−10 and ρ = 2.1.
The recommended algorithm is the infeasible semismooth method, tao_ssils, because of its strong global
and local convergence properties. However, if it is known that F is not defined outside of the box, [ℓ, u], per-
haps because of the presence of log functions, the feasibility-enforcing version of the algorithm, tao_ssfls,
is a reasonable alternative.
Active-Set Methods
TAO also contained two active-set semismooth methods for solving complementarity problems. These meth-
ods solve a reduced system constructed by block elimination of active constraints. The subdifferential in
these cases enables this block elimination.
The first active-set semismooth algorithm available in TAO is not guaranteed to remain feasible with respect
to the bounds, [ℓ, u], and is termed an infeasible active-set semismooth method. This method can be specified
by using the tao_asils solver.
An alternative is to remain feasible with respect to the bounds by using a projected Armijo line search. This
method can be specified by using the tao_asfls solver.
Quadratic Solvers
where the gradient and the Hessian of the objective are both constant.
The GPCG [MoreT91] algorithm is much like the TRON algorithm, discussed in Section Trust-Region Newton
Method (TRON), except that it assumes that the objective function is quadratic and convex. Therefore,
it evaluates the function, gradient, and Hessian only once. Since the objective function is quadratic, the
algorithm does not use a trust region. All the options that apply to TRON except for trust-region options
also apply to GPCG. It can be set by using the TAO solver tao_gpcg or via the optio flag -tao_type
gpcg.
The BQPIP algorithm is an interior-point method for bound constrained quadratic optimization. It can be
set by using the TAO solver of tao_bqpip or via the option flag -tao_type bgpip. Since it assumes
the objective function is quadratic, it evaluates the function, gradient, and Hessian only once. This method
also requires the solution of systems of linear equations, whose solver can be accessed and modified with the
command TaoGetKSP().
BMRM is a numerical approach to optimizing an unconstrained objective in the form of f (x) + 0.5 ∗ λkxk2 .
Here f is a convex function that is finite on the whole space. λ is a positive weight parameter, and kxk is
the Euclidean norm of x. The algorithm only requires a routine which, given an x, returns the value of f (x)
and the gradient of f at x.
OWLQN [AG07] is a numerical approach to optimizing an unconstrained objective in the form of f (x)+λkxk1 .
P f is a convex and differentiable function, λ is a positive weight parameter, and kxk1 is the ℓ1 norm of
Here
x: i |xi |. The algorithm only requires evaluating the value of f and its gradient.
The TRON [LMore99] algorithm is an active-set method that uses a combination of gradient projections
and a preconditioned conjugate gradient method to minimize an objective function. Each iteration of the
TRON algorithm requires function, gradient, and Hessian evaluations. In each iteration, the algorithm first
applies several conjugate gradient iterations. After these iterates, the TRON solver momentarily ignores the
variables that equal one of its bounds and applies a preconditioned conjugate gradient method to a quadratic
model of the remaining set of free variables.
The TRON algorithm solves a reduced linear system defined by the rows and columns corresponding to the
variables that lie between the upper and lower bounds. The TRON algorithm applies a trust region to the
conjugate gradients to ensure convergence. The initial trust-region radius can be set by using the command
TaoSetInitialTrustRegionRadius(), and the current trust region size can be found by using the
command TaoGetCurrentTrustRegionRadius(). The initial trust region can significantly alter the
rate of convergence for the algorithm and should be tuned and adjusted for optimal performance.
This algorithm will be deprecated in the next version in favor of the Bounded Newton Trust Region (BNTR)
algorithm.
BLMVM is a limited-memory, variable-metric method and is the bound-constrained variant of the LMVM
method for unconstrained optimization. It uses projected gradients to approximate the Hessian, eliminating
the need for Hessian evaluations. The method can be set by using the TAO solver tao_blmvm. For
more details, please see the LMVM section in the unconstrained algorithms as well as the LMVM matrix
documentation in the PETSc manual.
This algorithm will be deprecated in the next version in favor of the Bounded Quasi-Newton Line Search
(BQNLS) algorithm.
This section discusses options and routines that apply to most TAO solvers and problem classes. In particular,
we focus on linear solvers, convergence tests, and line searches.
Linear Solvers
One of the most computationally intensive phases of many optimization algorithms involves the solution of
linear systems of equations. The performance of the linear solver may be critical to an efficient computation
of the solution. Since linear equation solvers often have a wide variety of options associated with them, TAO
allows the user to access the linear solver with the
command. With access to the KSP object, users can customize it for their application to achieve improved
performance. Additional details on the KSP options in PETSc can be found in the User-Guide.
Monitors
By default the TAO solvers run silently without displaying information about the iterations. The user can
initiate monitoring with the command
The routine mon indicates a user-defined monitoring routine, and void* denotes an optional user-defined
context for private data for the monitor routine.
The routine set by TaoSetMonitor() is called once during each iteration of the optimization solver.
Hence, the user can employ this routine for any application-specific computations that should be done after
the solution update.
Convergence Tests
Convergence of a solver can be defined in many ways. The methods TAO uses by default are mentioned
in Convergence. These methods include absolute and relative convergence tolerances as well as a maximum
number of iterations of function evaluations. If these choices are not sufficient, the user can specify a
customized test
Users can set their own customized convergence tests of the form
The second argument is a pointer to a structure defined by the user. Within this routine, the solver can be
queried for the solution vector, gradient vector, or other statistic at the current iteration through routines
such as TaoGetSolutionStatus() and TaoGetTolerances().
To use this convergence test within a TAO solver, one uses the command
The second argument of this command is the convergence routine, and the final argument of the convergence
test routine denotes an optional user-defined context for private data. The convergence routine receives the
TAO solver and this private data structure. The termination flag can be set by using the routine
TaoSetConvergedReason(Tao, TaoConvergedReason);
Line Searches
By using the command line option -tao_ls_type. Available line searches include Moré-Thuente
[MoreT92], Armijo, gpcg, and unit.
The line search routines involve several parameters, which are set to defaults that are reasonable for many
applications. The user can override the defaults by using the following options
• -tao_ls_max_funcs <max>
• -tao_ls_stepmin <min>
• -tao_ls_stepmax <max>
• -tao_ls_ftol <ftol>
• -tao_ls_gtol <gtol>
• -tao_ls_rtol <rtol>
One should run a TAO program with the option -help for details. Users may write their own customized
line search codes by modeling them after one of the defaults provided.
Recycling History
Some TAO algorithms can re-use information accumulated in the previous TaoSolve() call to hot-start
the new solution. This can be enabled using the -tao_recycle_history flag, or in code via the TaoSe-
tRecycleHistory() interface.
For the nonlinear conjugate gradient solver (TAOBNCG), this option re-uses the latest search direction from
the previous TaoSolve() call to compute the initial search direction of a new TaoSolve(). By default,
the feature is disabled and the algorithm sets the initial direction as the negative gradient.
For the quasi-Newton family of methods (TAOBQNLS, TAOBQNKLS, TAOBQNKTR, TAOBQNKTL), this option
re-uses the accumulated quasi-Newton Hessian approximation from the previous TaoSolve() call. By
default, the feature is disabled and the algorithm will reset the quasi-Newton approximation to the identity
matrix at the beginning of every new TaoSolve().
The option flag has no effect on other TAO solvers.
One of the strengths of both TAO and PETSc is the ability to allow users to extend the built-in solvers with
new user-defined algorithms. It is certainly possible to develop new optimization algorithms outside of TAO
framework, but Using TAO to implement a solver has many advantages,
1. TAO includes other optimization solvers with an identical interface, so application problems may
conveniently switch solvers to compare their effectiveness.
2. TAO provides support for function evaluations and derivative information. It allows for the direct
evaluation of this information by the application developer, contains limited support for finite difference
approximations, and allows the uses of matrix-free methods. The solvers can obtain this function and
derivative information through a simple interface while the details of its computation are handled
within the toolkit.
3. TAO provides line searches, convergence tests, monitoring routines, and other tools that are helpful in
an optimization algorithm. The availability of these tools means that the developers of the optimization
solver do not have to write these utilities.
4. PETSc offers vectors, matrices, index sets, and linear solvers that can be used by the solver. These ob-
jects are standard mathematical constructions that have many different implementations. The objects
may be distributed over multiple processors, restricted to a single processor, have a dense representa-
tion, use a sparse data structure, or vary in many other ways. TAO solvers do not need to know how
these objects are represented or how the operations defined on them have been implemented. Instead,
the solvers apply these operations through an abstract interface that leaves the details to PETSc and
external libraries. This abstraction allows solvers to work seamlessly with a variety of data structures
while allowing application developers to select data structures tailored for their purposes.
5. PETSc provides the user a convenient method for setting options at runtime, performance profiling,
and debugging.
Header File
TAO solver implementation files must include the TAO implementation file taoimpl.h:
#include "petsc/private/taoimpl.h"
This file contains data elements that are generally kept hidden from application programmers, but may be
necessary for solver implementations to access.
TAO solvers must be written in C or C++ and include several routines with a particular calling sequence.
Two of these routines are mandatory: one that initializes the TAO structure with the appropriate information
and one that applies the algorithm to a problem instance. Additional routines may be written to set options
within the solver, view the solver, setup appropriate data structures, and destroy these data structures. In
order to implement the conjugate gradient algorithm, for example, the following structure is useful.
typedef struct{
PetscReal beta;
PetscReal eta;
PetscInt ngradtseps;
PetscInt nresetsteps;
Vec X_old;
(continues on next page)
} TAO_CG;
This structure contains two parameters, two counters, and two work vectors. Vectors for the solution and
gradient are not needed here because the TAO structure has pointers to them.
Solver Routine
All TAO solvers have a routine that accepts a TAO structure and computes a solution. TAO will call this
routine when the application program uses the routine TaoSolve() and will pass to the solver information
about the objective function and constraints, pointers to the variable vector and gradient vector, and support
for line searches, linear solvers, and convergence monitoring. As an example, consider the following code
that solves an unconstrained minimization problem using the conjugate gradient method.
PetscErrorCode TaoSolve_CG(Tao tao)
{
TAO_CG *cg = (TAO_CG *) tao->data;
Vec x = tao->solution;
Vec g = tao->gradient;
Vec s = tao->stepdirection;
PetscInt iter=0;
PetscReal gnormPrev,gdx,f,gnorm,steplength=0;
TaoLineSearchConvergedReason lsflag=TAO_LINESEARCH_CONTINUE_ITERATING;
TaoConvergedReason reason=TAO_CONTINUE_ITERATING;
PetscFunctionBegin;
PetscCall(TaoComputeObjectiveAndGradient(tao,x,&f,g));
PetscCall(VecNorm(g,NORM_2,&gnorm));
PetscCall(VecSet(s,0));
cg->beta=0;
gnormPrev = gnorm;
/* Enter loop */
while (1){
cg->beta=(gnorm*gnorm)/(gnormPrev*gnormPrev);
PetscCall(VecScale(s,cg->beta));
PetscCall(VecAXPY(s,-1.0,g));
PetscCall(VecDot(s,g,&gdx));
if (gdx>=0){ /* If not a descent direction, use gradient */
PetscCall(VecCopy(g,s));
PetscCall(VecScale(s,-1.0));
gdx=-gnorm*gnorm;
}
/* Line Search */
(continues on next page)
PetscFunctionReturn(PETSC_SUCCESS);
}
The first line of this routine casts the second argument to a pointer to a TAO_CG data structure. This
structure contains pointers to three vectors and a scalar that will be needed in the algorithm.
After declaring an initializing several variables, the solver lets TAO evaluate the function and gradient at the
current point in the using the routine TaoComputeObjectiveAndGradient(). Other routines may be
used to evaluate the Hessian matrix or evaluate constraints. TAO may obtain this information using direct
evaluation or other means, but these details do not affect our implementation of the algorithm.
The norm of the gradient is a standard measure used by unconstrained minimization solvers to define
convergence. This quantity is always nonnegative and equals zero at the solution. The solver will pass this
quantity, the current function value, the current iteration number, and a measure of infeasibility to TAO
with the routine
Most optimization algorithms are iterative, and solvers should include this command somewhere in each
iteration. This routine records this information, and applies any monitoring routines and convergence tests
set by default or the user. In this routine, the second argument is the current iteration number, and the
third argument is the current function value. The fourth argument is a nonnegative error measure associated
with the distance between the current solution and the optimal solution. Examples of this measure are the
norm of the gradient or the square root of a duality gap. The fifth argument is a nonnegative error that
usually represents a measure of the infeasibility such as the norm of the constraints or violation of bounds.
This number should be zero for unconstrained solvers. The sixth argument is a nonnegative steplength, or
the multiple of the step direction added to the previous iterate. The results of the convergence test are
returned in the last argument. If the termination reason is TAO_CONTINUE_ITERATING, the algorithm
should continue.
After this monitoring routine, the solver computes a step direction using the conjugate gradient algorithm
and computations using Vec objects. These methods include adding vectors together and computing an
inner product. A full list of these methods can be found in the manual pages.
Nonlinear conjugate gradient algorithms also require a line search. TAO provides several line searches and
support for using them. The routine
passes the current solution, gradient, and objective value to the line search and returns a new solution,
gradient, and objective value. More details on line searches can be found in Line Searches. The details of
the line search applied are specified elsewhere, when the line search is created.
TAO also includes support for linear solvers using PETSc KSP objects. Although this algorithm does not
require one, linear solvers are an important part of many algorithms. Details on the use of these solvers can
Creation Routine
The TAO solver is initialized for a particular algorithm in a separate routine. This routine sets default
convergence tolerances, creates a line search or linear solver if needed, and creates structures needed by this
solver. For example, the routine that creates the nonlinear conjugate gradient algorithm shown above can
be implemented as follows.
PetscFunctionBegin;
PetscCall(PetscNew(&cg));
tao->data = (void*)cg;
cg->eta = 0.1;
cg->delta_min = 1e-7;
cg->delta_max = 100;
cg->cg_type = CG_PolakRibierePlus;
tao->max_it = 2000;
tao->max_funcs = 4000;
tao->ops->setup = TaoSetUp_CG;
tao->ops->solve = TaoSolve_CG;
tao->ops->view = TaoView_CG;
tao->ops->setfromoptions = TaoSetFromOptions_CG;
tao->ops->destroy = TaoDestroy_CG;
PetscCall(TaoLineSearchCreate(((PetscObject)tao)->comm, &tao->linesearch));
PetscCall(TaoLineSearchSetType(tao->linesearch, morethuente_type));
PetscCall(TaoLineSearchUseTaoRoutines(tao->linesearch, tao));
PetscFunctionReturn(PETSC_SUCCESS);
}
EXTERN_C_END
This routine declares some variables and then allocates memory for the TAO_CG data structure. Notice that
the Tao object now has a pointer to this data structure (tao->data) so it can be accessed by the other
functions written for this solver implementation.
This routine also sets some default parameters particular to the conjugate gradient algorithm, sets default
convergence tolerances, and creates a particular line search. These defaults could be specified in the routine
that solves the problem, but specifying them here gives the user the opportunity to modify these parameters
either by using direct calls setting parameters or by using options.
Finally, this solver passes to TAO the names of all the other routines used by the solver.
Note that the lines EXTERN_C_BEGIN and EXTERN_C_END surround this routine. These macros are re-
quired to preserve the name of this function without any name-mangling from the C++ compiler (if used).
Destroy Routine
Another routine needed by most solvers destroys the data structures created by earlier routines. For the
nonlinear conjugate gradient method discussed earlier, the following routine destroys the two work vectors
and the TAO_CG structure.
PetscFunctionBegin;
PetscCall(VecDestroy(&cg->X_old));
PetscCall(VecDestroy(&cg->G_old));
PetscFree(tao->data);
tao->data = NULL;
PetscFunctionReturn(PETSC_SUCCESS);
}
This routine is called from within the TaoDestroy() routine. Only algorithm-specific data objects are
destroyed in this routine; any objects indexed by TAO (tao->linesearch, tao->ksp, tao->gradient,
etc.) will be destroyed by TAO immediately after the algorithm-specific destroy routine completes.
SetUp Routine
If the SetUp routine has been set by the initialization routine, TAO will call it during the execution of
TaoSolve(). While this routine is optional, it is often provided to allocate the gradient vector, work
vectors, and other data structures required by the solver. It should have the following form.
PetscCall(VecDuplicate(tao->solution,&tao->gradient));
PetscCall(VecDuplicate(tao->solution,&tao->stepdirection));
PetscCall(VecDuplicate(tao->solution,&cg->X_old));
PetscCall(VecDuplicate(tao->solution,&cg->G_old));
PetscFunctionReturn(PETSC_SUCCESS);
}
SetFromOptions Routine
The SetFromOptions routine should be used to check for any algorithm-specific options set by the user and
will be called when the application makes a call to TaoSetFromOptions(). It should have the following
form.
PetscFunctionReturn(PETSC_SUCCESS);
}
View Routine
The View routine should be used to output any algorithm-specific information or statistics at the end of a
solve. This routine will be called when the application makes a call to TaoView() or when the command
line option -tao_view is used. It should have the following form.
PetscFunctionBegin;
PetscCall(PetscViewerASCIIPushTab(viewer));
PetscCall(PetscViewerASCIIPrintf(viewer,"Grad. steps: %d\n",cg->ngradsteps));
PetscCall(PetscViewerASCIIPrintf(viewer,"Reset steps: %d\n",cg->nresetsteps));
PetscCall(PetscViewerASCIIPopTab(viewer));
PetscFunctionReturn(PETSC_SUCCESS);
}
Once a new solver is implemented, TAO needs to know the name of the solver and what function to use to
create the solver. To this end, one can use the routine
where name is the name of the solver (i.e., tao_blmvm), path is the path to the library containing the
solver, cname is the name of the routine that creates the solver (in our case, TaoCreate_CG), and create
is a pointer to that creation routine. If one is using dynamic loading, then the fourth argument will be
ignored.
Once the solver has been registered, the new solver can be selected either by using the TaoSetType()
function or by using the -tao_type command line option.
THREE
3.1 DM Basics
The previous chapters have focused on the core numerical solvers in PETSc. However, numerical solvers
without efficient ways (in both human and machine time) of connecting the solvers to the mathematical
models and discretizations that people wish to build their simulations on, will not get widely used. Thus
PETSc provides a set of abstractions represented by the DM object to provide a powerful, comprehensive
mechanism for translating the problem specification of a model and its discretization to the language and
API of solvers.
Some of the model classes DM currently supports are PDEs on structured and staggered grids with finite
difference methods (DMDA and DMSTAG – DMSTAG: Staggered, Structured Grid), PDEs on unstructured
grids with finite element and finite volume methods (DMPLEX – DMPlex: Unstructured Grids), PDEs on
quad and octree-grids (DMFOREST), models on networks (graphs) such as the power grid or river networks
(DMNETWORK – Networks), and particle-in-cell simulations (DMSWARM).
In previous chapters, we have demonstrated some simple usage of DM to provide the input for the solvers.
In this chapter, and those that follow, we will dive deep into the capabilities of DM.
It is possible to create a DM with
DM dm;
DMCreate(MPI_Comm comm, DM *dm);
DMSetType(DM dm, DMType type);
but more commonly, a DM is created with a type-specific constructor; the construction process for each type
of DM is discussed in the sections on each DMType. This chapter focuses on commonalities between all the
DM so we assume the DM already exists and we wish to work with it.
As discussed earlier, a DM can construct vectors and matrices appropriate for a model and discretization and
provide the mapping between the global and local vector representations.
The matrices produced may support MatSetValuesLocal() allowing one to work with the local num-
bering on each MPI rank. For DMDA one can also use MatSetValuesStencil() and for DMSTAG with
DMStagMatSetValuesStencil().
197
PETSc/TAO Users Manual, Release 3.19.2
A given DM can be refined for certain DMTypes with DMRefine() or coarsened with DMCoarsen(). Map-
pings between DMs may be obtained with routines such as DMCreateInterpolation(), DMCreateRe-
striction() and DMCreateInjection().
One attaches a DM to a solver object with
Once the DM is attached, the solver can utilize it to create and process much of the data that the solver needs
to set up and implement its solve. For example, with PCMG simply providing a DM can allow it to create all
the data structures needed to run geometric multigrid on your problem.
SNES Tutorial ex19 demonstrates how this may be done with DMDA.
This chapter introduces the DMPLEX subclass of DM, which allows the user to handle unstructured grids using
the generic DM interface for hierarchy and multi-physics. DMPLEX was created to remedy a huge problem in
all current PDE simulation codes, namely that the discretization was so closely tied to the data layout and
solver that switching discretizations in the same code was not possible. Not only does this preclude the kind
of comparison that is necessary for scientific investigation, but it makes library (as opposed to monolithic
application) development impossible.
The main advantage of DMPLEX in representing topology is that it treats all the different pieces of a mesh,
e.g. cells, faces, edges, and vertices, in the same way. This allows the interface to be small and simple, while
remaining flexible and general. This also allows “dimension independent programming”, which means that
the same algorithm can be used unchanged for meshes of different shapes and dimensions.
All pieces of the mesh (vertices, edges, faces, and cells) are treated as points, which are each identified by a
PetscInt. A mesh is built by relating points to other points, in particular specifying a “covering” relation
among the points. For example, an edge is defined by being covered by two vertices, and a triangle can
be defined by being covered by three edges (or even by three vertices). This structure is known as a Hasse
Diagram, which is a Directed Acyclic Graph (DAG) representing a cell complex using the covering relation.
The graph edges represent the relation, which also encodes a partially ordered set (poset).
For example, we can encode the doublet mesh as in Fig. 3.1,
which can also be represented as the DAG in Fig. 3.2.
To use the PETSc API, we consecutively number the mesh pieces. The PETSc convention in 3 dimensions is
to number first cells, then vertices, then faces, and then edges. In 2 dimensions the convention is to number
faces, vertices, and then edges. In terms of the labels in Fig. 3.1, these numberings are
f0 7→ 0, f1 7→ 1, v0 7→ 2, v1 7→ 3, v2 7→ 4, v3 7→ 5, e0 7→ 6, e1 7→ 7, e2 7→ 8, e3 7→ 9, e4 7→ 10
DMPlexSetChart(dm, 0, 11);
Note that a chart here corresponds to a semi-closed interval (e.g [0, 11) = {0, 1, . . . , 10}) specifying the range
of indices we’d like to use to define points on the current rank. We then define the covering relation, which
Fig. 3.2: The Hasse diagram for our 2D doublet mesh, expressed as a DAG.
we call the cone, which are also the in-edges in the DAG. In order to preallocate correctly, we first provide
sizes,
and then point values (recall each point is an integer that represents a single geometric entity, a cell, face,
edge, or vertex),
There is also an API for providing the dual relation, using DMPlexSetSupportSize() and DMPlexSet-
Support(), but this can be calculated automatically using the provided DMPlexSetConeSize() and
DMPlexSetCone() information and then calling
DMPlexSymmetrize(dm);
The “symmetrization” is in the sense of the DAG. Each point knows its covering (cone) and each point
knows what it covers (support). Note that when using automatic symmetrization, cones will be ordered but
supports will not. The user can enforce an ordering on supports by rewriting them after symmetrization
using DMPlexSetSupport().
In order to support efficient queries, we construct fast search structures and indices for the different types
of points using
DMPlexStratify(dm);
Plex allows you to represent periodic domains is two ways. Using the default scheme, periodic topology
can be represented directly. This ensures that all topological queries can be satisified, but then care must
be taken in representing functions over the mesh, such as the coordinates. The second method is to use
a non-periodic topology, but connect certain mesh points using the local-to-global map for that DM. This
allows a more general set of mappings to be implemented, such as partial twists, but topological queries on
the periodic boundary cease to function.
For the default scheme, a call to DMLocalizeCoordinates() (which usually happens automatically on mesh
creation) creates a second, discontinuous coordinate field. These values can be accessed using DMGetCell-
Coordinates() and DMGetCellCoordinatesLocal(). Plex provides a convenience method, DMPlexGetCellCo-
ordinates(), that extracts cell coordinates correctly, depending on the periodicity of the mesh. An example
of its use is shown below:
}
PetscCall(PetscPrintf(PETSC_COMM_SELF, ")"));
}
PetscCall(PetscPrintf(PETSC_COMM_SELF, "\n"));
PetscCall(DMPlexRestoreCellCoordinates(dm, cell, &isDG, &numCoords, &array, &coords));
Specific entries (or collections of entries) in a Vec (or a simple array) can be associated with a “location” on
a mesh (or other types of data structure) using the PetscSection object. A point is a PetscInt that
serves as an abstract “index” into arrays from iteratable sets, such as points on a mesh.
PetscSection has two modes of operation.
Mode 1:
A PetscSection associates a set of degrees of freedom (dof), (a small space {ek }0 < k < dp ), with every
point. The number of dof and their meaning may be different for different points. For example, the dof on a
cell point may represent pressure while a dof on a face point may represent velocity. Though points must be
contiguously numbered, they can be in any range [pStart, pEnd), which is called a chart. A PetscSection
in mode 1 may be thought of as defining a two dimensional array indexed by point in the outer dimension
with a variable length inner dimension indexed by the dof at that point, v[pStart <= point < pEnd][0 <=
dof < dp ]1 .
The sequence for constructing a PetscSection in mode 1 is the following:
1 A PetscSection can be thought of as a generalization of PetscLayout, in the same way that a fiber bundle is a
generalization of the normal Euclidean basis used in linear algebra. With PetscLayout, we associate a unit vector (ei ) with
every point in the space, and just divide up points between processes.
Mode 2:
The value in the array is then accessed with array[offset]. If there are multiple dof at a point (and
field in mode 2) then array[offset + 1], etc give access to each of those dof.
Using the mesh from Fig. 3.1, we provide an example of creating a PetscSection using mode 1. We can
lay out data for a continuous Galerkin P3 finite element method,
DMPlexGetHeightStratum() returns all the points of the requested height in the DAG. Since this prob-
lem is in two dimensions the edges are at height 1 and the vertices at height 2 (the cells are always at height
0). One can also use DMPlexGetDepthStratum() to use the depth in the DAG to select the points.
DMPlexGetDepth(dm,&depth) returns the depth of the DAG, hence DMPlexGetDepthStratum(dm,
depth-1-h,) returns the same values as DMPlexGetHeightStratum(dm,h,).
For P3 elements there is one degree of freedom at each vertex, 2 along each edge (resulting in a total of 4
degrees of freedom along each edge including the vertices, thus being able to reproduce a cubic function)
and 1 degree of freedom within the cell (the bubble function which is zero along all edges).
Now a PETSc local vector can be created manually using this layout,
PetscSectionGetStorageSize(s, &n);
VecSetSizes(localVec, n, PETSC_DETERMINE);
VecSetFromOptions(localVec);
When working with DMPLEX and PetscFE (see below) one can simply get the sections (and related vectors)
with
DMSetLocalSection(dm, s);
DMGetLocalVector(dm, &localVec);
DMGetGlobalVector(dm, &globalVec);
A global vector is missing both the shared dofs which are not owned by this process, as well as constrained
dofs. These constraints represent essential (Dirichlet) boundary conditions. They are dofs that have a given
fixed value, so they are present in local vectors for assembly purposes, but absent from global vectors since
they are never solved for during algebraic solves.
We can indicate constraints in a local section using PetscSectionSetConstraintDof(), to set the
number of constrained dofs for a given point, and PetscSectionSetConstraintIndices() which
indicates which dofs on the given point are constrained. Once we have this information, a global section can
be created using PetscSectionCreateGlobalSection(), and this is done automatically by the DM. A
global section returns −(dof + 1) for the number of dofs on an unowned point, and −(of f + 1) for its offset
on the owning process. This can be used to create global vectors, just as the local section is used to create
local vectors.
A DM can automatically create the local section if given a description of the discretization, for example using
a PetscFE object. We demonstrate this by creating a PetscFE that can be configured from the command
line. It is a single, scalar field, and is added to the DM using DMSetField(). When a local or global vector
is requested, the DM builds the local and global sections automatically.
DMPlexIsSimplex(dm, &simplex);
PetscFECreateDefault(PETSC_COMM_SELF, dim, 1, simplex, NULL, -1, &fe);
DMSetField(dm, 0, NULL, (PetscObject) fe);
DMCreateDS(dm);
Here the call to DMSetField() declares the discretization will have one field with the integer label 0 that
has one degree of freedom at each point on the DMPlex. To get the P3 section above, we can either give the
option -petscspace_degree 3, or call PetscFECreateLagrange() and set the degree directly.
In the same way as MatPartitioning or MatGetOrdering(), give the results of a partitioning or order-
ing of a graph defined by a sparse matrix, PetscPartitionerDMPlexPartition or DMPlexPermute
are encoded in an IS. However, the graph is not the adjacency graph of the matrix but the mesh itself. Once
the mesh is partitioned and reordered, the data layout from a PetscSection can be used to automatically
derive a problem partitioning/ordering.
The Jacobian of a problem represents the influence of some variable on other variables in the problem. Very
often, this influence pattern is determined jointly by the computational mesh and discretization. DMCre-
ateMatrix() must compute this pattern when it automatically creates the properly preallocated Jacobian
matrix. In DMDA the influence pattern, or what we will call variable adjacency, depends only on the stencil
since the topology is Cartesian and the discretization is implicitly finite difference.
In DMPLEX, we allow the user to specify the adjacency topologically, while maintaining good defaults.
The pattern is controlled by two flags. The first flag, useCone, indicates whether variables couple first
to their boundary2 and then to neighboring entities, or the reverse. For example, in finite elements, the
variables couple to the set of neighboring cells containing the mesh point, and we set the flag to useCone
= PETSC_FALSE. By constrast, in finite volumes, cell variables first couple to the cell boundary, and then
to the neighbors, so we set the flag to useCone = PETSC_TRUE. The second flag, useClosure, indicates
whether we consider the transitive closure of the neighbor relation above, or just a single level. For example,
in finite elements, the entire boundary of any cell couples to the interior, and we set the flag to useClosure
= PETSC_TRUE. By contrast, in most finite volume methods, cells couple only across faces, and not through
vertices, so we set the flag to useClosure = PETSC_FALSE. However, the power of this method is its
flexibility. If we wanted a finite volume method that coupled all cells around a vertex, we could easily
prescribe that by changing to useClosure = PETSC_TRUE.
2 The boundary of a cell is its faces, the boundary of a face is its edges and the boundary of an edge is the two vertices.
The evaluation of a residual or Jacobian, for most discretizations has the following general form:
• Traverse the mesh, picking out pieces (which in general overlap),
• Extract some values from the current solution vector, associated with this piece,
• Calculate some values for the piece, and
• Insert these values into the residual vector
DMPlex separates these different concerns by passing sets of points from mesh traversal routines to data
extraction routines and back. In this way, the PetscSection which structures the data inside a Vec does
not need to know anything about the mesh inside a DMPLEX.
The most common mesh traversal is the transitive closure of a point, which is exactly the transitive closure
of a point in the DAG using the covering relation. In other words, the transitive closure consists of all
points that cover the given point (generally a cell) plus all points that cover those points, etc. So in 2d the
transitive closure for a cell consists of edges and vertices while in 3d it consists of faces, edges, and vertices.
Note that this closure can be calculated orienting the arrows in either direction. For example, in a finite
element calculation, we calculate an integral over each element, and then sum up the contributions to the
basis function coefficients. The closure of the element can be expressed discretely as the transitive closure
of the element point in the mesh DAG, where each point also has an orientation. Then we can retrieve the
data using PetscSection methods,
PetscScalar *a;
PetscInt numPoints, *points = NULL, p;
VecGetArrayRead(u,&a);
DMPlexGetTransitiveClosure(dm,cell,PETSC_TRUE,&numPoints,&points);
for (p = 0; p <= numPoints*2; p += 2) {
PetscInt dof, off, d;
This operation is so common that we have built a convenience method around it which returns the values
in a contiguous array, correctly taking into account the orientations of various mesh points:
PetscScalar *a;
PetscInt points[2*2], numPoints, p, dofA, offA, dofB, offB;
VecGetArray(u, &a);
DMPlexGetTransitiveClosure(dm, cell, PETSC_FALSE, &numPoints, &points);
assert(numPoints == 2);
PetscSectionGetDof(section, points[0*2], &dofA);
PetscSectionGetDof(section, points[1*2], &dofB);
assert(dofA == dofB);
PetscSectionGetOffset(section, points[0*2], &offA);
PetscSectionGetOffset(section, points[1*2], &offB);
myfunc(a[offA], a[offB]);
VecRestoreArray(u, &a);
PETSc allows users to save/load DMPLEXs representing meshes, PetscSections representing data layouts
on the meshes, and Vecs defined on the data layouts to/from an HDF5 file in parallel, where one can use
different number of processes for saving and for loading.
Saving
The simplest way to save DM data is to use options for configuration. This requires only the code
Options prefixes can be used to separately control the saving and loading of various fields. However, the
user can have finer grained control by explicitly creating the PETSc objects involved. To save data to
“example.h5” file, we can first create a PetscViewer of type PETSCVIEWERHDF5 in FILE_MODE_WRITE
mode as:
PetscViewer viewer;
As dm is a DMPLEX object representing a mesh, we first give it a mesh name, “plexA”, and save it as:
PetscObjectSetName((PetscObject)dm, "plexA");
PetscViewerPushFormat(viewer, PETSC_VIEWER_HDF5_PETSC);
DMView(dm, viewer);
PetscViewerPopFormat(viewer);
DMPlexTopologyView(dm, viewer);
DMPlexCoordinatesView(dm, viewer);
DMPlexLabelsView(dm, viewer);
If the mesh name is not explicitly set, the default name is used. In the above PETSC_VIEWER_HDF5_PETSC
format was used to save the entire representation of the mesh. This format also saves global point numbers
attached to the mesh points. In this example the set of all global point numbers is X = [0, 11).
The data layout, s, needs to be wrapped in a DM object for it to be saved. Here, we create the wrapping DM,
sdm, with DMClone(), give it a dm name, “dmA”, attach s to sdm, and save it as:
DMClone(dm, &sdm);
PetscObjectSetName((PetscObject)sdm, "dmA");
DMSetLocalSection(sdm, s);
DMPlexSectionView(dm, viewer, sdm);
If the dm name is not explicitly set, the default name is to be used. In the above, instead of using DMClone(),
one could also create a new DMSHELL object to attach s to. The first argument of DMPlexSectionView()
is a DMPLEX object that represents the mesh, and the third argument is a DM object that carries the data
layout that we would like to save. They are, in general, two different objects, and the former carries a mesh
name, while the latter carries a dm name. These names are used to construct a group structure in the
HDF5 file. Note that the data layout points are associated with the mesh points, so each of them can also
be tagged with a global point number in X; DMPlexSectionView() saves these tags along with the data
layout itself, so that, when the mesh and the data layout are loaded separately later, one can associate the
points in the former with those in the latter by comparing their global point numbers.
We now create a local vector assiciated with sdm, e.g., as:
Vec vec;
DMGetLocalVector(sdm, &vec);
A global vector can be saved in the exact same way with trivial changes.
After saving, we destroy the PetscViewer with:
PetscViewerDestroy(&viewer);
Since PETSc 3.19, we offer a new format which supports parallel loading. To write in this format, you
currently need to specify it explicitly using the option
-dm_plex_view_hdf5_storage_version 3.0.0
The storage version is stored in the file and set automatically when loading (described below). You can
check the storage version of the HDF5 file with
To allow for simple and efficient implementation, and good load balancing, the 3.0.0 format changes the way
the mesh topology is stored. Different strata (sets of mesh entities with an equal dimension; or vertices, edges,
faces, and cells) are now stored separately. The new structure of /topologies/<mesh_name>/topology
is following:
Loading
We then create a DMPLEX object, give it a mesh name, “plexA”, and load the mesh as:
DMCreate(PETSC_COMM_WORLD, &dm);
DMSetType(dm, DMPLEX);
PetscObjectSetName((PetscObject)dm, "plexA");
PetscViewerPushFormat(viewer, PETSC_VIEWER_HDF5_PETSC);
DMLoad(dm, viewer);
PetscViewerPopFormat(viewer);
where PETSC_VIEWER_HDF5_PETSC format was again used. The user can have more control by replace
the single load call with
PetscSF sfO;
DMCreate(PETSC_COMM_WORLD, &dm);
DMSetType(dm, DMPLEX);
PetscObjectSetName((PetscObject)dm, "plexA");
PetscViewerPushFormat(viewer, PETSC_VIEWER_HDF5_PETSC);
DMPlexTopologyLoad(dm, viewer, &sfO);
DMPlexCoordinatesLoad(dm, viewer, sfO);
PetscViewerPopFormat(viewer);
The object returned by DMPlexTopologyLoad(), sfO, is a PetscSF that pushes forward X to the loaded
mesh, dm; this PetscSF is constructed with the global point number tags that we saved along with the
mesh points.
As the DMPLEX mesh just loaded might not have a desired distribution, it is common to redistribute the
mesh for a better distribution using DMPlexDistribute(), e.g., as:
DM distributedDM;
PetscInt overlap = 1;
PetscSF sfDist, sf;
Note that the new DMPLEX does not automatically inherit the mesh name, so we need to name it “plexA”
once again. sfDist is a PetscSF that pushes forward the loaded mesh to the redistributed mesh, so,
composed with sfO, it makes the PetscSF that pushes forward X directly to the redistributed mesh, which
we call sf.
We then create a new DM, sdm, with DMClone(), give it a dm name, “dmA”, and load the on-disk data
layout into sdm as:
DMClone(dm, &sdm);
PetscObjectSetName((PetscObject)sdm, "dmA");
DMPlexSectionLoad(dm, viewer, sdm, sf, &globalDataSF, &localDataSF);
where we could also create a new DMSHELL object instead of using DMClone(). Each point in the on-disk
data layout being tagged with a global point number in X, DMPlexSectionLoad() internally constructs
a PetscSF that pushes forward the on-disk data layout to X. Composing this with sf, DMPlexSection-
Load() internally constructs another PetscSF that pushes forward the on-disk data layout directly to the
redistributed mesh. It then reconstructs the data layout s on the redistributed mesh and attaches it to sdm.
The objects returned by this function, globalDataSF and localDataSF, are PetscSFs that can be used
to migrate the on-disk vector data into local and global Vecs defined on sdm.
We now create a local vector assiciated with sdm, e.g., as:
Vec vec;
DMGetLocalVector(sdm, &vec);
We then name vec “vecA” and load the on-disk vector into vec as:
PetscObjectSetName((PetscObject)vec, "vecA");
DMPlexLocalVectorLoad(dm, viewer, sdm, localDataSF, localVec);
where localDataSF knows how to migrate the on-disk vector data into a local Vec defined on sdm. The
on-disk vector can be loaded into a global vector associated with sdm in the exact same way with trivial
changes.
After loading, we destroy the PetscViewer with:
PetscViewerDestroy(&viewer);
The above infrastructure works seamlessly in distributed-memory parallel settings, in which one can even use
different number of processes for saving and for loading; a more comprehensive example is found in DMPlex
Tutorial ex12.
DMPlex supports mesh adaptation using the Riemannian metric framework. The idea is to use a Riemannian
metric space within the mesher. The metric space dictates how mesh resolution should be distributed across
the domain. Using this information, the remesher transforms the mesh such that it is a unit mesh when
viewed in the metric space. That is, the image of each of its elements under the mapping from Euclidean
space into the metric space has edges of unit length.
One of the main advantages of metric-based mesh adaptation is that it allows for fully anisotropic remeshing.
That is, it provides a means of controlling the shape and orientation of elements in the adapted mesh, as well
as their size. This can be particularly useful for advection-dominated and directionally-dependent problems.
A metric must be symmetric positive-definite, so that distances may be properly defined. This can be checked
using
This routine may also be used to enforce minimum and maximum tolerated metric magnitudes (i.e. cell
sizes), as well as maximum anisotropy. These quantities can be specified using
-dm_plex_metric_h_min <h_min>
-dm_plex_metric_h_max <h_max>
-dm_plex_metric_a_max <a_max>
One simple way to combine two metrics is by simply averaging them entry-by-entry. Another is to intersect
them, which amounts to choosing the greatest level of refinement in each direction. These operations are
available in PETSc through the routines
However, before combining metrics, it is important that they are scaled in the same way. Scaling also allows
the user to control the number of vertices in the adapted mesh (in an approximate sense). This is achieved
using the Lp normalization framework, with the routine
There are two important parameters for the normalization: the normalization order p and the target metric
complexity, which is analogous to the vertex count. They are controlled using
-dm_plex_metric_p <p>
-dm_plex_metric_target_complexity <target>
Mmg is a serial package, whereas ParMmg is the MPI version. Note that surface meshing is not currently
supported and that ParMmg works only in three dimensions. Mmg can be used for both two and three
dimensional problems. Pragmatic, Mmg and ParMmg may be specified by the command line arguments
-dm_adaptor pragmatic
-dm_adaptor mmg
-dm_adaptor parmmg
Once a metric has been constructed, it can be used to perform metric-based mesh adaptation using the
routine
where bdLabel and rgLabel are boundary and interior tags to be preserved under adaptation, respectively.
For structured (aka “regular”) grids with staggered data (living on elements, faces, edges, and/or vertices),
the DMSTAG object is available. This can be useful for problems in many domains, including fluid flow,
MHD, and seismology.
It is possible, though cumbersome, to implement a staggered-grid code using multiple DMDA objects, or a
single multi-component DMDA object where some degrees of freedom are unused. DMSTAG was developed for
two main purposes:
1. To help manage some of the burden of choosing and adhering to the complex indexing conventions
needed for staggered grids (in parallel)
2. To provide a uniform abstraction for which scalable solvers and preconditioners may be developed (in
particular, using PCFIELDSPLIT and PCMG).
DMSTAG is design to behave much like DMDA, with a couple of important distinctions, and borrows some
terminology from DMPLEX.
3.3.1 Terminology
Like a DMPLEX object, a DMSTAG represents a cell complex, distributed in parallel over the ranks of an
MPI_Comm. It is, however, a very regular complex, consisting of a structured grid of d-dimensional cells,
with d ∈ {1, 2, 3}, which are referred to as elements, d − 1 dimensional cells defining boundaries between
these elements, and the boundaries of the domain, and in 2 or more dimensions, boundaries of these cells,
all the way down to 0 dimensional cells referred to as vertices. In 2 dimensions, the 1-dimensional element
boundaries are referred to as edges or faces. In 3 dimensions, the 2-dimensional element boundaries are
referred to as faces and the 1-dimensional boundaries between faces are referred to as edges The set of cells
of a given dimension is referred to as a stratum (which one can think of as a level in DAG representation of
the mesh); a DMSTAG object of dimension d represents a complete cell complex with d + 1 strata (levels).
In the description of any:chapter_unstructured the cells at each level are referred to as points. Thus we
adopt that terminology uniformly in PETSc and so furthermore in this document, point will refer to a cell.
Each stratum has a constant number of unknowns (which may be zero) associated with each point (cell) on
that level. The distinct unknowns associated with each point are referred to as components.
The structured grid, is like with DMDA, decomposed via a Cartesian product of decompositions in each
dimension, giving a rectangular local subdomain on each rank. This is extended by an element-wise stencil
width of ghost elements to create an atlas of overlapping patches.
DMSTAG allows the user to reason almost entirely about a global indexing of elements. Element indices are
simply 1-3 PetscInt values, starting at 0, in the back, bottom, left corner of the domain. For instance,
element (1, 2, 3), in 3D, is the element second from the left, third from the bottom, and fourth from the back
(regardless of how many MPI ranks are used).
To refer to points (elements, faces, edges, and vertices), a value of DMStagStencilLocation is used,
relative to the element index. The element itself is referred to with DMSTAG_ELEMENT, the top right vertex
(in 2D) or the top right edge (in 3D) with DMSTAG_UP_RIGHT, the back bottom left corner in 3D with
DMSTAG_BACK_DOWN_LEFT, and so on.
Fig. 3.3 gives a few examples in 2D.
Fig. 3.3: Locations in DMSTAG are indexed according to global element indices (here, two in 2D) and a
location name. Elements have unique names but other locations can be referred to in more than one way.
Element colors correspond to a parallel decomposition, but locations on the grid have names which are
invariant to this. Note that the face on the top right can be referred to as being to the left of a “dummy”
element (3, 3) outside the physical domain.
Crucially, this global indexing scheme does not include any “ghost” or “padding” unknowns outside the
physical domain. This is useful for higher-level operations such as computing norms or developing physics-
based solvers. However (unlike DMDA), this implies that the global Vec do not have a natural block structure,
as different strata have different numbers of points (e.g. in 1D there is an “extra” vertex on the right). This
regular block structure is, however, very useful for the local representation of the data, so in that case dummy
DOF are included, drawn as grey in Fig. 3.4.
For working with Vec data, this approach is used to allow direct access to a multi-dimensional, regular-
Fig. 3.4: Local and global representations for a 2D DMSTAG object, 3 by 4 elements, with one degree of
freedom on each of the the three strata: element (squares), faces (triangles), and vertices (circles). The
cell complex is parallelized across 4 MPI ranks. In the global representation, the colors correspond to
which rank holds the native representation of the unknown. The 4 local representations are shown, with an
(elementwise) stencil “box” stencil width of 1. Unknownd are color by their native rank. Dummy unknowns,
which correspond to no global degree of freedom, are colored grey. Note that the local representations have
have a natural block size of 4, and the global representation has no natural block size.
blocked array. To avoid the user having to know about the internal numbering conventions used, helper
functions are used to produce the proper final integer index for a given location and component, referred to
as a “slot”. Similarly to DMDAVecGetArrayDOF(), this uses a d + 1 dimensional array in d dimensions.
The following snippet give an example of this usage.
}
}
PetscCall(DMStagVecRestoreArray(dm, x, &x_array));
DMSTAG provides a stencil-based method for getting and setting entries of Mat and Vec objects. The
follow excerpt from DMSTAG Tutorial ex1 demonstrates the idea. For more, see the manual page for
DMStagMatSetValuesStencil().
row.i = e;
row.loc = LEFT;
row.c = 0;
val = 1.0;
PetscCall(DMStagMatSetValuesStencil(dmSol, A, 1, &row, 1, &row, &val, INSERT_
,→VALUES));
The array-based approach for Vec is likely to be more efficient than the stencil-based method just introduced
above.
3.3.3 Coordinates
DMSTAG, unlike DMDA, supports two approaches to defining coordinates. This is captured by which type of
DM is used to represent the coordinates. No default is imposed, so the user must directly or indirectly call
DMStagSetCoordinateDMType().
If a second DMSTAG object is used to represent coordinates in “explicit” form, behavior is much like with
DMDA - the coordinate DM has d DOF on each stratum corresponding to coordinates associated with each
point.
If DMPRODUCT is used instead, coordinates are represented by a DMPRODUCT object referring to a Cartesian
product of 1D DMSTAG objects, each of which features explicit coordinates as just mentioned.
Navigating these nested DM in DMPRODUCT can be tedious, but note the existence of helper functions like
DMStagSetUniformCoordinatesProduct() and DMStagGetProductCoordinateArrays().
While DMSTAG aims to hide the details of its internal data layout, for debugging, optimization, and cus-
tomization purposes, it can be important to know how DMSTAG internally numbers unknowns.
Internally, each point is canonically associated with an element (top-level point (cell)). For purposes of local,
regular-blocked storage, an element is grouped with lower-dimensional points left of, below (“down”), and be-
hind (“back”) it. This means that “canonical” values of DMStagStencilLocation are DMSTAG_ELEMENT,
plus all entries consisting only of “LEFT”, “DOWN”, and “BACK”. In general, these are the most efficient
values to use, unless convenience dictates otherwise, as they are the ones used internally.
When creating the decomposition of the domain to local ranks, and extending these local domains to handle
overlapping halo regions and boundary ghost unknowns, this same per-element association is used. This has
the advantage of maintaining a regular blocking, but may not be optimal in some situations in terms of data
movement.
Numberings are, like DMDA, based on a local “x-fastest, z-slowest” or “PETSc” ordering of elements (see
Application Orderings), with ordering of locations canonically associated with each element decided by
considering unknowns on each point to be located at the center of their point, and using a nested ordering of
the same style. Thus, in 3-D, the ordering of the 8 canonical DMStagStencilLocation values associated
with an element is
DMSTAG_BACK_DOWN_LEFT
DMSTAG_BACK_DOWN
DMSTAG_BACK_LEFT
DMSTAG_BACK
DMSTAG_DOWN_LEFT
DMSTAG_DOWN
DMSTAG_LEFT
DMSTAG_ELEMENT
Multiple DOF associated with a given point are stored sequentially (as with DMDA).
For local Vecs, this gives a regular-blocked numbering, with the same number of unknowns associated with
each element, including some “dummy” unknowns which to not correspond to any (local or global) unknown
in the global representation. See Fig. 3.6 for an example.
In the global representation, only physical unknowns are numbered (using the same “Z” ordering for un-
knowns which are present), giving irregular numbers of unknowns, depending on whether a domain boundary
is present. See Fig. 3.5 for an example.
It should be noted that this is an interlaced (AoS) representation. If a segregated (SoA) representation is
required, one should use DMCOMPOSITE collecting several DMSTAG objects, perhaps using DMStagCreate-
CompatibleDMStag() to quickly create additional DMSTAG objects from an initial one.
3.4 Networks
The DMNETWORK class provides abstractions for representing general unstructured networks such as commu-
nication networks, power grid, computer networks, transportation networks, electrical circuits, graphs, and
others.
Fig. 3.5: Global numbering scheme for a 2D DMSTAG object with one DOF per stratum. Note that the
numbering depends on the parallel decomposition (over 4 ranks, here).
Fig. 3.6: Local numbering scheme on rank 1 (Cf. Fig. 3.4) for a 2D DMSTAG object with one DOF per
stratum. Note that dummy locations (grey) are used to give a regular block size (here, 4).
2. Create components and register them with the network. A “component” is specific application data at a
vertex/edge of the network required for its residual evaluation. For example, components could be resis-
tor/inductor data for circuit applications, edge weights for graph problems, or generator/transmission
line data for power grids. Components are registered by calling
Here, name is the component name, size is the size of component data, and compkey is an integer
key that can be used for setting/getting the component at a vertex or an edge.
3. A DMNETWORK can consist of one or more physical subnetworks. Each subnetwork has its own mathe-
matical model. When multiple subnetworks are used one can (optionally) provide coupling information
between subnetworks. That is vertices that are shared between multiple subnetworks; edges can only
belong to a single subnetwork. The number of subnetwork is set by calling
Here, nsubnet and Nsubnet are the local and global number of subnetworks.
4. A subnetwork is added to the network by calling
Here name is the subnetwork name, ne is the number of local edges on the subnetwork, and edgelist
is the connectivity for the subnetwork. The output netnum is the global numbering of the subnetwork
in the network. Each element of edgelist is an integer array of size 2*ne containing the edge
connectivity for the subnetwork.
As an example, consider a network comprised of 2 subnetworks that are coupled. The topological
information for the network is as follows:
subnetwork 0: v0 — v1 — v2 — v3
subnetwork 1: v1 — v2 — v0
The two subnetworks are coupled by merging vertex 0 from subnetwork 0 with vertex 2 from
subnetwork 1.
The edgelist of this network is
edgelist[0] = {0,1,1,2,2,3}
edgelist[1] = {1,2,2,0}
Here anet and bnet are the first and second subnetwork global numberings returned by DMNet-
workAddSubnetwork(), nsv is the number of vertices shared by the two subnetworks, asv and
bsv are the vertex indices in the subnetwork anet and bnet .
5. The next step is to have DMNETWORK create a bare layout (graph) of the network by calling
DMNetworkLayoutSetUp(DM dm);
6. After completing the previous steps, the network graph is set up, but no physics is associated yet. This
is done by adding the components and setting the number of variables to the vertices and edges.
A component and number of variables are added to a vertex/edge by calling
where p is the network vertex/edge point in the range obtained by either DMNet-
workGetVertexRange()/DMNetworkGetEdgeRange(), DMNetworkGetSubnetwork(), or
DMNetworkGetSharedVertices(); compkey is the component key returned when registering the
component (DMNetworkRegisterComponent()); compdata holds the data for the component;
and nvar is the number of variables associated to the added component at this network point. DM-
NETWORK supports setting multiple components at a vertex/edge. At a shared vertex, DMNETWORK
currently requires the owner process of the vertex adds all the components and number of variables.
DMNETWORK currently assumes the component data to be stored in a contiguous chunk of memory. As
such, it does not do any packing/unpacking before/after the component data gets distributed. Any
such serialization (packing/unpacking) should be done by the application.
7. Set up network internal data structures.
DMSetUp(DM dm);
8. Distribute the network (also moves components attached with vertices/edges) to multiple processors.
DMNETWORK provides several utility functions for operations on the network. The most commonly used
functions are: obtaining iterators for vertices/edges,
and retrieving local/global indices of vertex/edge component variables for inserting elements in vec-
tors/matrices,
In network applications, one frequently needs to find the supporting edges for a vertex or the connecting
vertices covering an edge. These can be obtained by the following two routines.
The components and the corresponding number of variables set at a vertex/edge can be accessed by
input compnum is the component number, output compkey is the key set by DMNetworkRegisterCom-
ponent(). An example of accessing and retrieving the components and number of variables at vertices
is:
PetscInt Start,End,numcomps,key,v,compnum;
void *component;
The above example does not explicitly use the component key. It is used when different component types
are set at different vertices. In this case, compkey is used to differentiate the component type.
This chapter discusses the low-level infrastructure which supports higher-level discretizations in PETSc,
which includes things such as quadrature and probability distributions.
3.5.1 Quadrature
A probability distribution function (PDF) returns the probability density at a given location P (x), so that the
probability for an event at location in [x, x + dx] is P (x)dx. This means that we must have the normalization
condition,
Z
P (x)dx = 1.
Ω
where :math:Omega is the domain for x. This requires that the PDF must have units which are the inverse
of the volume form dx, meaning that it is homogeneous of order d under scaling
= P (s)λ−d λd ds
ZΩ
= P (s)ds
Ω
=1
The cumulative distribution function (CDF) is the incomplete integral of the PDF,
Z x
C(x) = P (s)ds
x−
where x− is the lower limit of our domain. We can work out the effect of scaling on the CDF using this
definition,
Z λx
C(λx) = P (s)ds
x−
Z x
= λd P (λt)dt
x−
Z x
= P (t)dt
x−
= C(x)
This chapter introduces the PetscFE class, and related subclasses PetscSpace and PetscDualSpace,
which are used to represent finite element discretizations. It details there interaction with the DMPLEX class
to assemble functions and operators over computational meshes, and produce optimal solvers by constructing
multilevel iterations, for example using PCPATCH. The idea behind these classes is not to encompass all of
computational finite elements, but rather to establish an interface and infrastructure that will allow PETSc
to leverage the excellent work done in packages such as Firedrake, FEniCS, LibMesh, and Deal.II.
See the paper about Unified Residual Evaluation, which explains the use of pointwise evaluation functions
to describe weak forms.
A finite element problem is presented to PETSc in a series of steps. This is both to facilitate automation,
and to allow multiple entry points for user code and external packages since so much finite element software
already exists. First, we tell the DM, usually a DMPLEX or DMFOREST, that we have a set of finite element
fields which we intended to solve for in our problem, using
The second argument is a DMLabel object indicating the support of the field on the mesh, with NULL
indicating the entire domain. Once we have a set of fields, we calls
DMCreateDS(dm);
This divides the computational domain into subdomains, called regions in PETSc, each with a unique set
of fields supported on it. These subdomain are identified by labels, and each one has a PetscDS object
describing the discrete system on that subdomain. There are query functions to get the set of PetscDS
objects for the DM, but it is usually easiest to get the proper PetscDS for a given cell using
Each PetscDS object has a set of fields, each with a PetscFE or PetscFV discretization. This allows it to
calculate the size of the local discrete approximation, as well as allocate scratch space for all the associated
computations. The final thing needed is to specify the actual equations to be enforced on each region. The
PetscDS contains a PetscWeakForm object that holds callback function pointers that define the equations.
A simplified, top-level interface through PetscDS allows users to quickly define problems for a single region.
For example, in SNES Tutorial ex13, we define the Poisson problem using
DMLabel label;
PetscInt f = 0, id = 1;
{
PetscInt d;
*u = 0.0;
for (d = 0; d < dim; ++d) *u += PetscSinReal(2.0*PETSC_PI*x[d]);
return 0;
}
{
PetscInt d;
for (d = 0; d < dim; ++d) f1[d] = u_x[d];
}
{
PetscInt d;
for (d = 0; d < dim; ++d) g3[d*dim+d] = 1.0;
}
Notice that we set boundary conditions using DMAddBoundary, which will be described later in this chapter.
Also we set an exact solution for the field. This can be used to automatically calculate mesh convergence
using the PetscConvEst object described later in this chapter.
For more complex cases with multiple regions, we need to use the PetscWeakForm interface directly. The
weak form object allows you to set any number of functions for a given field, and also allows functions to be
associated with particular subsets of the mesh using labels and label values. We can reproduce the above
problem using the SetIndex variants which only set a single function at the specified index, rather than a
list of functions. We use a NULL label and value, meaning that the entire domain is used.
PetscInt f = 0, val = 0;
PetscDSGetWeakForm(ds, &wf);
PetscWeakFormSetIndexResidual(ds, NULL, val, f, 0, 0, f0_trig_inhomogeneous_u, 0, f1_
,→u);
In SNES Tutorial ex23, we define the Poisson problem over the entire domain, but in the top half we also
define a pressure. The entire problem can be specified as follows
In the PyLith software we use this capability to combine bulk elasticity with a fault constitutive model
integrated over the embedded manifolds corresponding to earthquake faults.
Once the pointwise functions are set in each PetscDS, mesh traversals can be automatically determined
from the DMLabel and value specifications in the keys. This default traversal strategy can be activated by
attaching the DM and default callbacks to a solver
SNESSetDM(snes, dm);
DMPlexSetSNESLocalFEM(dm, &user, &user, &user);
TSSetDM(ts, dm);
DMTSSetBoundaryLocal(dm, DMPlexTSComputeBoundary, &user);
DMTSSetIFunctionLocal(dm, DMPlexTSComputeIFunctionFEM, &user);
DMTSSetIJacobianLocal(dm, DMPlexTSComputeIJacobianFEM, &user);
FOUR
ADDITIONAL INFORMATION
Most of the functionality of PETSc can be obtained from Fortran programs. Make sure the suffix of all your
Fortran files is .F90, not .f or .f90.
To use PETSc with Fortran you must use both PETSc include files and modules. At the beginning of every
function and module definition you need something like
#include "petsc/finclude/petscXXX.h"
use petscXXX
You can declare PETSc object variables using either of the following:
XXX variablename
type(tXXX) variablename
For example,
#include "petsc/finclude/petscvec.h"
use petscvec
Vec b
type(tVec) x
PETSc types like PetscInt and PetscReal are simply aliases for basic Fortran types and cannot be
written as type(tPetscInt)
The Fortran include files for PETSc are located in the directory $PETSC_DIR/include/petsc/
finclude and the module files are located in $PETSC_DIR/$PETSC_ARCH/include
Most Fortran routines have the same names as the corresponding C versions, and PETSc command line
options are fully supported. The routine arguments follow the usual Fortran conventions; the user need
not worry about passing pointers or values. The calling sequences for the Fortran version are in most cases
identical to the C version, except for the error checking variable discussed in Error Checking and a few
routines listed in Routines with Different Fortran Interfaces.
When passing floating point numbers into PETSc Fortran subroutines, always make sure you have them
marked as double precision (e.g., pass in 10.d0 instead of 10.0 or declare them as PETSc variables, e.g.
PetscScalar one = 1.0). Otherwise, the compiler interprets the input as a single precision number,
227
PETSc/TAO Users Manual, Release 3.19.2
which can cause crashes or other mysterious problems. We highly recommend using the implicit none
option at the beginning of each Fortran subroutine and declare all variables.
Error Checking
In the Fortran version, each PETSc routine has as its final argument an integer error variable. The error
code is set to be nonzero if an error has been detected; otherwise, it is zero. For example, the Fortran and C
variants of KSPSolve() are given, respectively, below, where ierr denotes the PetscErrorCode error
variable:
For proper error handling one should not use the above syntax instead one should use
Different compilers have different methods of naming Fortran routines called from C (or C routines called
from Fortran). Most Fortran compilers change all the capital letters in Fortran routines to all lowercase.
With some compilers, the Fortran compiler appends an underscore to the end of each Fortran routine name;
for example, the Fortran routine Dabsc() would be called from C with dabsc_(). Other compilers change
all the letters in Fortran routine names to capitals.
PETSc provides two macros (defined in C/C++) to help write portable code that mixes C/C++ and For-
tran. They are PETSC_HAVE_FORTRAN_UNDERSCORE and PETSC_HAVE_FORTRAN_CAPS , which will be
defined in the file $PETSC_DIR/$PETSC_ARCH/include/petscconf.h based on the compilers conven-
tions. The macros are used, for example, as follows:
#if defined(PETSC_HAVE_FORTRAN_CAPS)
#define dabsc_ DABSC
#elif !defined(PETSC_HAVE_FORTRAN_UNDERSCORE)
#define dabsc_ dabsc
#endif
.....
dabsc_( &n,x,y); /* call the Fortran function */
Many PETSc C functions have the option of passing a NULL argument (for example, the fifth argument
of MatCreateSeqAIJ()). From Fortran, users must pass PETSC_NULL_XXX to indicate a null argument
(where XXX is INTEGER, DOUBLE, CHARACTER, SCALAR, VEC, MAT, etc depending on the argument type);
passing a literal 0 from Fortran in this case will crash the code. For example, when no options prefix is
desired in the routine PetscOptionsGetInt(), one must use the following command in Fortran:
All matrices, vectors and IS in PETSc use zero-based indexing, regardless of whether C or Fortran is being
used. For example, MatSetValues() and VecSetValues() always use zero indexing. See Basic Matrix
Operations for further details.
Setting Routines
When a function pointer (declared as external in Fortran) is passed as an argument to a PETSc function,
such as the test function in KSPSetConvergenceTest(), it is assumed that this function references a
routine written in the same language as the PETSc interface function that was called. For instance, if
KSPSetConvergenceTest() is called from C, the test function must be a C function. Likewise, if it is
called from Fortran, the test function must be (a subroutine) written in Fortran.
The following Fortran routines differ slightly from their C counterparts; see the manual pages and previous
discussion in this chapter for details:
PetscInitialize()
PetscOptionsGetString()
The Fortran interface to VecDuplicateVecs() differs slightly from the C/C++ variant. To create n
vectors of the same format as an existing vector, the user must declare a vector array, v_new of size n.
Then, after VecDuplicateVecs() has been called, v_new will contain (pointers to) the new PETSc vector
objects. When finished with the vectors, the user should destroy them by calling VecDestroyVecs(). For
example, the following code fragment duplicates v_old to form two new vectors, v_new(1) and v_new(2).
Vec v_old, v_new(2)
PetscInt ierr
PetscScalar alpha
....
PetscCall(VecDuplicateVecs(v_old, 2, v_new, ierr))
alpha = 4.3
PetscCall(VecSet(v_new(1), alpha, ierr))
alpha = 6.0
PetscCall(VecSet(v_new(2), alpha, ierr))
(continues on next page)
Many PETSc functions that return an array of values in C in an argument (such as ISGetIndices())
return an allocatable array in Fortran. The Fortran function names for these are suffixed with F90 as
indicated below. A few routines, such as VecDuplicateVecs() discussed above, do not return a Fortran
allocatable array; a large enough array must be explicitly declared before use.
C-API Fortran-API
ISGetIndices() ISGetIndicesF90()
ISRestoreIndices() ISRestoreIndicesF90()
ISLocalToGlobalMappingGetIndices() ISLocalToGlobalMappingGetIndicesF90()
ISLocalToGlobalMappingRestor- ISLocalToGlobalMappingRestor-
eIndices() eIndicesF90()
VecGetArray() VecGetArrayF90()
VecRestoreArray() VecRestoreArrayF90()
VecGetArrayRead() VecGetArrayReadF90()
VecRestoreArrayRead() VecRestoreArrayReadF90()
VecDuplicateVecs() VecDuplicateVecsF90()
VecDestroyVecs() VecDestroyVecsF90()
DMDAVecGetArray() DMDAVecGetArrayF90()
DMDAVecRestoreArray() DMDAVecRestoreArrayF90()
DMDAVecGetArrayRead() DMDAVecGetArrayReadF90()
DMDAVecRestoreArrayRead() DMDAVecRestoreArrayReadF90()
DMDAVecGetArrayWrite() DMDAVecGetArrayWriteF90()
DMDAVecRestoreArrayWrite() DMDAVecRestoreArrayWriteF90()
MatGetRowIJ() MatGetRowIJF90()
MatRestoreRowIJ() MatRestoreRowIJF90()
MatSeqAIJGetArray() MatSeqAIJGetArrayF90()
MatSeqAIJRestoreArray() MatSeqAIJRestoreArrayF90()
MatMPIAIJGetSeqAIJ() MatMPIAIJGetSeqAIJF90()
MatDenseGetArray() MatDenseGetArrayF90()
MatDenseRestoreArray() MatDenseRestoreArrayF90()
The array arguments to these Fortran functions should be declared with forms such as
See the manual pages for details and pointers to example programs.
Sample programs that illustrate the PETSc interface for Fortran are given below, corresponding to Vec
Test ex19f, Vec Tutorial ex4f, Draw Test ex5f, and SNES Tutorial ex1f, respectively. We also refer Fortran
programmers to the C examples listed throughout the manual, since PETSc usage within the two languages
differs only slightly.
Listing: src/vec/vec/tests/ex19f.F90
!
!
program main
#include <petsc/finclude/petscvec.h>
use petscvec
implicit none
!
! This example demonstrates basic use of the PETSc Fortran interface
! to vectors.
!
PetscInt n
PetscErrorCode ierr
PetscBool flg
PetscScalar one,two,three,dot
PetscReal norm,rdot
Vec x,y,w
PetscOptions options
n = 20
one = 1.0
two = 2.0
three = 3.0
PetscCallA(PetscInitialize(ierr))
PetscCallA(PetscOptionsCreate(options,ierr))
PetscCallA(PetscOptionsGetInt(options,PETSC_NULL_CHARACTER,'-n',n,flg,ierr))
PetscCallA(PetscOptionsDestroy(options,ierr))
PetscCallA(VecSet(x,one,ierr))
PetscCallA(VecSet(y,two,ierr))
PetscCallA(VecDot(x,y,dot,ierr))
rdot = PetscRealPart(dot)
write(6,100) rdot
100 format('Result of inner product ',f10.4)
PetscCallA(VecScale(x,two,ierr))
PetscCallA(VecNorm(x,NORM_2,norm,ierr))
write(6,110) norm
110 format('Result of scaling ',f10.4)
PetscCallA(VecAXPY(y,three,x,ierr))
PetscCallA(VecNorm(y,NORM_2,norm,ierr))
write(6,130) norm
130 format('Result of axpy ',f10.4)
PetscCallA(VecDestroy(x,ierr))
PetscCallA(VecDestroy(y,ierr))
PetscCallA(VecDestroy(w,ierr))
PetscCallA(PetscFinalize(ierr))
end
Listing: src/vec/vec/tutorials/ex4f.F90
!
!
! Description: Illustrates the use of VecSetValues() to set
! multiple values at once; demonstrates VecGetArrayF90().
!
! -----------------------------------------------------------------------
program main
#include <petsc/finclude/petscvec.h>
use petscvec
implicit none
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
! Beginning of program
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PetscScalar xwork(6)
PetscScalar, pointer :: xx_v(:),yy_v(:)
PetscInt i,n,loc(6),isix
PetscErrorCode ierr
Vec x,y
PetscCallA(PetscInitialize(ierr))
n = 6
isix = 6
PetscCallA(VecCreateSeq(PETSC_COMM_SELF,n,x,ierr))
PetscCallA(VecDuplicate(x,y,ierr))
! Fill work arrays with vector entries and locations. Note that
! the vector indices are 0-based in PETSc (for both Fortran and
! C vectors)
do 10 i=1,n
(continues on next page)
PetscCallA(VecSetValues(x,isix,loc,xwork,INSERT_VALUES,ierr))
! Assemble vector
PetscCallA(VecAssemblyBegin(x,ierr))
PetscCallA(VecAssemblyEnd(x,ierr))
! View vector
PetscCallA(PetscObjectSetName(x, 'initial vector:',ierr))
PetscCallA(VecView(x,PETSC_VIEWER_STDOUT_SELF,ierr))
PetscCallA(VecCopy(x,y,ierr))
PetscCallA(VecGetArrayF90(x,xx_v,ierr))
PetscCallA(VecGetArrayF90(y,yy_v,ierr))
do 30 i=1,n
xx_v(i) = 100.0*real(i)
yy_v(i) = 1000.0*real(i)
30 continue
! Restore vectors
PetscCallA(VecRestoreArrayF90(x,xx_v,ierr))
PetscCallA(VecRestoreArrayF90(y,yy_v,ierr))
! View vectors
PetscCallA(PetscObjectSetName(x, 'new vector 1:',ierr))
PetscCallA(VecView(x,PETSC_VIEWER_STDOUT_SELF,ierr))
! Free work space. All PETSc objects should be destroyed when they
! are no longer needed.
PetscCallA(VecDestroy(x,ierr))
PetscCallA(VecDestroy(y,ierr))
Listing: src/sys/classes/draw/tests/ex5f.F90
!
!
program main
#include <petsc/finclude/petscsys.h>
#include <petsc/finclude/petscdraw.h>
use petscsys
implicit none
!
! This example demonstrates basic use of the Fortran interface for
! PetscDraw routines.
!
PetscDraw draw
PetscDrawLG lg
PetscDrawAxis axis
PetscErrorCode ierr
PetscBool flg
integer x,y,width,height
PetscScalar xd,yd
PetscReal ten
PetscInt i,n,w,h
PetscInt one
n = 15
x = 0
y = 0
w = 400
h = 300
ten = 10.0
one = 1
PetscCallA(PetscInitialize(ierr))
width = int(w)
PetscCallA(PetscOptionsGetInt(PETSC_NULL_OPTIONS,PETSC_NULL_CHARACTER,'-height',
,→h,flg,ierr))
height = int(h)
PetscCallA(PetscOptionsGetInt(PETSC_NULL_OPTIONS,PETSC_NULL_CHARACTER,'-n',n,
,→flg,ierr))
PetscCallA(PetscDrawCreate(PETSC_COMM_WORLD,PETSC_NULL_CHARACTER,PETSC_NULL_
,→ CHARACTER,x,y,width,height,draw,ierr))
PetscCallA(PetscDrawSetFromOptions(draw,ierr))
PetscCallA(PetscDrawLGCreate(draw,one,lg,ierr))
PetscCallA(PetscDrawLGGetAxis(lg,axis,ierr))
PetscCallA(PetscDrawAxisSetColors(axis,PETSC_DRAW_BLACK,PETSC_DRAW_RED,PETSC_
(continues on next page)
do 10, i=0,n-1
xd = real(i) - 5.0
yd = xd*xd
PetscCallA(PetscDrawLGAddPoint(lg,xd,yd,ierr))
10 continue
PetscCallA(PetscDrawLGSetUseMarkers(lg,PETSC_TRUE,ierr))
PetscCallA(PetscDrawLGDraw(lg,ierr))
PetscCallA(PetscSleep(ten,ierr))
PetscCallA(PetscDrawLGDestroy(lg,ierr))
PetscCallA(PetscDrawDestroy(draw,ierr))
PetscCallA(PetscFinalize(ierr))
end
Listing: src/snes/tutorials/ex1f.F90
!
!
! Description: Uses the Newton method to solve a two-variable system.
!
program main
#include <petsc/finclude/petsc.h>
use petsc
implicit none
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
! Variable declarations
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
!
! Variables:
! snes - nonlinear solver
! ksp - linear solver
! pc - preconditioner context
! ksp - Krylov subspace method context
! x, r - solution, residual vectors
! J - Jacobian matrix
! its - iterations for convergence
!
SNES snes
PC pc
KSP ksp
Vec x,r
Mat J
SNESLineSearch linesearch
PetscErrorCode ierr
PetscInt its,i2,i20
PetscMPIInt size,rank
PetscScalar pfive
(continues on next page)
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
! Beginning of program
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PetscCallA(PetscInitialize(ierr))
PetscCallA(PetscLogNestedBegin(ierr))
threshold = 1.0
PetscCallA(PetscLogSetThreshold(threshold,oldthreshold,ierr))
! dummy test of logging a reduction
#if defined(PETSC_USE_LOG)
ierr = PetscAReduce()
#endif
PetscCallMPIA(MPI_Comm_size(PETSC_COMM_WORLD,size,ierr))
PetscCallMPIA(MPI_Comm_rank(PETSC_COMM_WORLD,rank,ierr))
if (size .ne. 1) then; SETERRA(PETSC_COMM_SELF,PETSC_ERR_WRONG_MPI_SIZE,
,→'Uniprocessor example'); endif
i2 = 2
i20 = 20
! - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - -
! Create nonlinear solver context
! - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - -
PetscCallA(SNESCreate(PETSC_COMM_WORLD,snes,ierr))
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
! Create matrix and vector data structures; set corresponding routines
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PetscCallA(VecCreateSeq(PETSC_COMM_SELF,i2,x,ierr))
PetscCallA(VecDuplicate(x,r,ierr))
PetscCallA(MatCreate(PETSC_COMM_SELF,J,ierr))
PetscCallA(MatSetSizes(J,PETSC_DECIDE,PETSC_DECIDE,i2,i2,ierr))
PetscCallA(MatSetFromOptions(J,ierr))
PetscCallA(MatSetUp(J,ierr))
PetscCallA(SNESSetFunction(snes,r,FormFunction,0,ierr))
PetscCallA(SNESSetJacobian(snes,J,J,FormJacobian,0,ierr))
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
! Customize nonlinear solver; set runtime options
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PetscCallA(SNESGetKSP(snes,ksp,ierr))
PetscCallA(KSPGetPC(ksp,pc,ierr))
PetscCallA(PCSetType(pc,PCNONE,ierr))
tol = 1.e-4
PetscCallA(KSPSetTolerances(ksp,tol,PETSC_DEFAULT_REAL,PETSC_DEFAULT_REAL,i20,
,→ierr))
PetscCallA(SNESSetFromOptions(snes,ierr))
PetscCallA(PetscOptionsHasName(PETSC_NULL_OPTIONS,PETSC_NULL_CHARACTER,'-setls',
,→ setls,ierr))
if (setls) then
PetscCallA(SNESGetLineSearch(snes, linesearch, ierr))
PetscCallA(SNESLineSearchSetType(linesearch, 'shell', ierr))
PetscCallA(SNESLineSearchShellSetUserFunc(linesearch, MyLineSearch,0,ierr))
endif
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
! Evaluate initial guess; then solve nonlinear system
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
! Note: The user should initialize the vector, x, with the initial guess
! for the nonlinear solver prior to calling SNESSolve(). In particular,
! to employ an initial guess of zero, the user should explicitly set
! this vector to zero by calling VecSet().
pfive = 0.5
PetscCallA(VecSet(x,pfive,ierr))
PetscCallA(SNESSolve(snes,PETSC_NULL_VEC,x,ierr))
! View solver converged reason; we could instead use the option -snes_converged_
,→ reason
PetscCallA(SNESConvergedReasonView(snes,PETSC_VIEWER_STDOUT_WORLD,ierr))
PetscCallA(SNESGetIterationNumber(snes,its,ierr))
if (rank .eq. 0) then
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
! Free work space. All PETSc objects should be destroyed when they
! are no longer needed.
! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PetscCallA(VecDestroy(x,ierr))
PetscCallA(VecDestroy(r,ierr))
PetscCallA(MatDestroy(J,ierr))
PetscCallA(SNESDestroy(snes,ierr))
#if defined(PETSC_USE_LOG)
PetscCallA(PetscViewerASCIIOpen(PETSC_COMM_WORLD,'filename.xml',viewer,ierr))
PetscCallA(PetscViewerPushFormat(viewer,PETSC_VIEWER_ASCII_XML,ierr))
PetscCallA(PetscLogView(viewer,ierr))
PetscCallA(PetscViewerDestroy(viewer,ierr))
#endif
PetscCallA(PetscFinalize(ierr))
end
!
! ------------------------------------------------------------------------
!
! FormFunction - Evaluates nonlinear function, F(x).
!
! Input Parameters:
! snes - the SNES context
! x - input vector
! dummy - optional user-defined context (not used here)
!
! Output Parameter:
! f - function vector
!
subroutine FormFunction(snes,x,f,dummy,ierr)
use petscsnes
implicit none
SNES snes
Vec x,f
PetscErrorCode ierr
integer dummy(*)
PetscCall(VecGetArrayReadF90(x,lx_v,ierr))
PetscCall(VecGetArrayF90(f,lf_v,ierr))
! Compute function
! Restore vectors
PetscCall(VecRestoreArrayReadF90(x,lx_v,ierr))
PetscCall(VecRestoreArrayF90(f,lf_v,ierr))
return
end
! ---------------------------------------------------------------------
!
! FormJacobian - Evaluates Jacobian matrix.
!
! Input Parameters:
! snes - the SNES context
! x - input vector
! dummy - optional user-defined context (not used here)
!
! Output Parameters:
! A - Jacobian matrix
! B - optionally different preconditioning matrix
!
subroutine FormJacobian(snes,X,jac,B,dummy,ierr)
use petscsnes
implicit none
SNES snes
Vec X
Mat jac,B
PetscScalar A(4)
PetscErrorCode ierr
PetscInt idx(2),i2
integer dummy(*)
PetscScalar,pointer :: lx_v(:)
i2 = 2
PetscCall(VecGetArrayReadF90(x,lx_v,ierr))
idx(1) = 0
idx(2) = 1
A(1) = 2.0*lx_v(1) + lx_v(2)
A(2) = lx_v(1)
A(3) = lx_v(2)
! Restore vector
PetscCall(VecRestoreArrayReadF90(x,lx_v,ierr))
! Assemble matrix
PetscCall(MatAssemblyBegin(B,MAT_FINAL_ASSEMBLY,ierr))
PetscCall(MatAssemblyEnd(B,MAT_FINAL_ASSEMBLY,ierr))
if (B .ne. jac) then
PetscCall(MatAssemblyBegin(jac,MAT_FINAL_ASSEMBLY,ierr))
PetscCall(MatAssemblyEnd(jac,MAT_FINAL_ASSEMBLY,ierr))
endif
return
end
SNESLineSearch linesearch
SNES snes
integer lctx
Vec x, f,g, y, w
PetscReal ynorm,gnorm,xnorm
PetscErrorCode ierr
PetscScalar mone
mone = -1.0
PetscCall(SNESLineSearchGetSNES(linesearch, snes, ierr))
PetscCall(SNESLineSearchGetVecs(linesearch, x, f, y, w, g, ierr))
PetscCall(VecNorm(y,NORM_2,ynorm,ierr))
PetscCall(VecAXPY(x,mone,y,ierr))
PetscCall(SNESComputeFunction(snes,x,f,ierr))
PetscCall(VecNorm(f,NORM_2,gnorm,ierr))
PetscCall(VecNorm(x,NORM_2,xnorm,ierr))
PetscCall(VecNorm(y,NORM_2,ynorm,ierr))
PetscCall(SNESLineSearchSetNorms(linesearch, xnorm, gnorm, ynorm,ierr))
return
end
One can dump PETSc matrices and vectors to the screen in an ASCII format that MATLAB can read
in directly. This is done with the command line options -vec_view ::ascii_matlab or -mat_view
::ascii_matlab. To write a file, use -vec_view :filename.m:ascii_matlab or -mat_view
:filename.m:ascii_matlab.
This causes the PETSc program to print the vectors and matrices every time VecAssemblyEnd() or
MatAssemblyEnd() are called. To provide finer control over when and what vectors and matrices are
dumped one can use the VecView() and MatView() functions with a viewer type of PETSCVIEWERASCII
(see PetscViewerASCIIOpen(), PETSC_VIEWER_STDOUT_WORLD, PETSC_VIEWER_STDOUT_SELF,
or PETSC_VIEWER_STDOUT_(MPI_Comm)). Before calling the viewer set the output type with, for ex-
ample,
PetscViewerPushFormat(PETSC_VIEWER_STDOUT_WORLD,PETSC_VIEWER_ASCII_MATLAB);
VecView(A,PETSC_VIEWER_STDOUT_WORLD);
PetscViewerPopFormat(PETSC_VIEWER_STDOUT_WORLD);
The name of each PETSc variable printed for MATLAB may be set with
PetscObjectSetName((PetscObject)A,"name");
One can also read PETSc binary files (see Viewers: Looking at PETSc Objects) directly into MATLAB
via the scripts available in $PETSC_DIR/share/petsc/matlab. This requires less disk space and is
recommended for all but the smallest data sizes. One can also use
PetscViewerPushFormat(viewer,PETSC_VIEWER_BINARY_MATLAB)
to dump both a PETSc binary file and a corresponding .info file which PetscReadBinaryMatlab.m
will use to format the binary file in more complex cases, such as using a DMDA. For an example, see DM
Tutorial ex7. In MATLAB one may then generate a useful structure. For example:
setenv('PETSC_DIR','~/petsc');
setenv('PETSC_ARCH','arch-darwin-double-debug');
addpath('~/petsc/share/petsc/matlab');
gridData=PetscReadBinaryMatlab('output_file');
(port is usually set to PETSC_DEFAULT; use NULL for the machine if the MATLAB interactive session is
running on the same machine as the PETSc program) and then sends matrices or vectors via
VecView(Vec A,v);
MatView(Mat B,v);
See Viewers: Looking at PETSc Objects for more on PETSc viewers. One may start the MATLAB pro-
gram manually or use the PETSc command PetscStartMatlab(MPI_Comm,char *machine,char
*script,FILE **fp); where machine and script may be NULL. It is also possible to start your
PETSc program from MATLAB via launch().
To receive the objects in MATLAB, make sure that $PETSC_DIR/$PETSC_ARCH/lib/petsc/matlab
and $PETSC_DIR/share/petsc/matlab are in the MATLAB path. Use p = PetscOpenSocket();
(or p = PetscOpenSocket(portnum) if you provided a port number in your call to PetscView-
erSocketOpen()), and then a = PetscBinaryRead(p); returns the object passed from PETSc.
PetscBinaryRead() may be called any number of times. Each call should correspond on the PETSc side
with viewing a single vector or matrix. close() closes the connection from MATLAB. On the PETSc side,
one should destroy the viewer object with PetscViewerDestroy().
For an example, which includes sending data back to PETSc, see Vec Tutorial ex42 and the associated .m
file.
where machine is the name of the machine hosting MATLAB (NULL may be used for localhost). One can
send objects to MATLAB via
One cannot use MATLAB interactively in this mode but one can send MATLAB commands via
PetscMatlabEngineEvaluate(PetscMatlabEngine,"format",...);
The name of each PETSc variable passed to MATLAB may be set with
PetscObjectSetName((PetscObject)A,"name");
PetscMatlabEngineGetOutput(PetscMatlabEngine,char **);
or
PetscMatlabEnginePrintOutput(PetscMatlabEngine,FILE*).
To activate MATLAB on head node which does not have access to the internet.1
First ssh into the head node using the command: ssh node_name
Obtain the Host Id using the command: ip addr | grep ether2 You will see something like this: link/ether
xx:xx:xx:xx:xx:xx ABC yy:yy:yy:yy:yy:yy Note the value: xx:xx:xx:xx:xx:xx
Login to your MathWorks Account from a computer which has internet access. You will see the available
license that your account has. Select a license from the list.
Then, select Install and Activate option and select the Activate to Retrieve License File option.
4.3 Profiling
PETSc includes a consistent, lightweight scheme for profiling application programs. The PETSc routines
automatically log performance data if certain options are specified at runtime. The user can also log in-
formation about application codes for a complete picture of performance. In addition, as described in
Interpreting -log_view Output: The Basics, PETSc provides a mechanism for printing informative messages
about computations. Basic Profiling Information introduces the various profiling options in PETSc, while
the remainder of the chapter focuses on details such as monitoring application codes and tips for accurate
profiling.
If an application code and the PETSc libraries have been configured with --with-log=1, the default, then
various kinds of profiling of code between calls to PetscInitialize() and PetscFinalize() can be
activated at runtime. The profiling options include the following:
• -log_view [:filename] - Prints an ASCII version of performance data at program’s conclu-
sion. These statistics are comprehensive and concise and require little overhead; thus, -log_view is
intended as the primary means of monitoring the performance of PETSc codes.
• -info [infofile] - Prints verbose information about code to stdout or an optional file. This
option provides details about algorithms, data structures, etc. Since the overhead of printing such
output slows a code, this option should not be used when evaluating a program’s performance.
• -log_trace [logfile] - Traces the beginning and ending of all PETSc events. This option, which
can be used in conjunction with -info, is useful to see where a program is hanging without running
in the debugger.
As discussed in Using -log_mpe with Jumpshot, additional profiling can be done with MPE.
As shown in listing in Profiling Programs, the option -log_view [:filename] activates printing of profile
data to standard output or an ASCII file at the conclusion of a program. See PetscLogView() for all the
possible output options.
We print performance data for each routine, organized by PETSc libraries, followed by any user-defined
events (discussed in Profiling Application Codes). For each routine, the output data include the maximum
time and floating point operation (flop) rate over all processes. Information about parallel performance is
also included, as discussed in the following section.
For the purpose of PETSc floating point operation counting, we define one flop as one operation of any of the
following types: multiplication, division, addition, or subtraction. For example, one VecAXPY() operation,
which computes y = αx + y for vectors of length N , requires 2N flop (consisting of N additions and N
multiplications). Bear in mind that flop rates present only a limited view of performance, since memory
loads and stores are the real performance barrier.
For simplicity, the remainder of this discussion focuses on interpreting profile data for the KSP library,
which provides the linear solvers at the heart of the PETSc package. Recall the hierarchical organization
of the PETSc library, as shown in Numerical Libraries in PETSc. Each KSP solver is composed of a PC
(preconditioner) and a KSP (Krylov subspace) part, which are in turn built on top of the Mat (matrix) and
Vec (vector) modules. Thus, operations in the KSP module are composed of lower-level operations in these
packages. Note also that the nonlinear solvers library, SNES, is built on top of the KSP module, and the
timestepping library, TS, is in turn built on top of SNES.
We briefly discuss interpretation of the sample output in listing, which was generated by solving a linear
system on one process using restarted GMRES and ILU preconditioning. The linear solvers in KSP consist of
two basic phases, KSPSetUp() and KSPSolve(), each of which consists of a variety of actions, depending
on the particular solution technique. For the case of using the PCILU preconditioner and KSPGMRES Krylov
subspace method, the breakdown of PETSc routines is listed below. As indicated by the levels of indentation,
the operations in KSPSetUp() include all of the operations within PCSetUp(), which in turn include
MatILUFactor(), and so on.
• KSPSetUp - Set up linear solver
– PCSetUp - Set up preconditioner
∗ MatILUFactor - Factor preconditioning matrix
· MatILUFactorSymbolic - Symbolic factorization phase
· MatLUFactorNumeric - Numeric factorization phase
• KSPSolve - Solve linear system
– PCApply - Apply preconditioner
∗ MatSolve - Forward/backward triangular solves
– KSPGMRESOrthog - Orthogonalization in GMRES
∗ VecDot or VecMDot - Inner products
∗ VecAXPY or VecMAXPY - vector updates
– MatMult - Matrix-vector product
– MatMultAdd - Matrix-vector product + vector addition
∗ VecScale, VecNorm, VecAXPY, VecCopy, …
The summaries printed via -log_view reflect this routine hierarchy. For example, the performance sum-
maries for a particular high-level routine such as KSPSolve() include all of the operations accumulated in
the lower-level components that make up the routine.
The output produced with -log_view is flat, meaning that the hierarchy of PETSc operations is not
completely clear. For a particular problem, the user should generally have an idea of the basic operations
that are required for its implementation (e.g., which operations are performed when using GMRES and ILU,
as described above), so that interpreting the -log_view data should be relatively straightforward. If this
is problematic then it is also possible to examine the profiling information in a nested format. For more
information see Profiling Nested Events.
We next discuss performance summaries for parallel programs, as shown within listing and listing, which
presents the output generated by the -log_view option. The program that generated this data is KSP
Tutorial ex10. The code loads a matrix and right-hand-side vector from a binary file and then solves the
resulting linear system; the program then repeats this process for a second linear system. This particular
case was run on four processors of an Intel x86_64 Linux cluster, using restarted GMRES and the block
Jacobi preconditioner, where each block was solved with ILU. The two input files medium and arco6 can
be downloaded from this FTP link.
The first listing presents an overall performance summary, including times, floating-point operations, com-
putational rates, and message-passing activity (such as the number and size of messages sent and collective
operations). Summaries for various user-defined stages of monitoring (as discussed in Profiling Multiple Sec-
tions of Code) are also given. Information about the various phases of computation then follow (as shown
separately here in the second listing). Finally, a summary of object creation and destruction is presented.
Number of iterations = 19
Residual norm 1.088292e-05
Number of iterations = 59
Residual norm 3.871022e-02
---------------------------------------------- PETSc Performance Summary: ------------
,→----------------------------------
./ex10 on a intel-bdw-opt named beboplogin4 with 4 processors, by jczhang Mon Apr 23␣
,→13:36:54 2018
Summary of Stages: ----- Time ------ ----- Flop ----- --- Messages --- --␣
,→Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total Avg ␣
,→ %Total counts %Total
0: Main Stage: 5.9897e-04 0.3% 0.0000e+00 0.0% 0.000e+00 0.0% 0.
,→000e+00 0.0% 2.000e+00 0.9%
1: Load System 0: 2.9113e-03 1.6% 0.0000e+00 0.0% 3.550e+01 4.3% 5.
,→984e+02 0.1% 2.200e+01 10.2%
2: KSPSetUp 0: 7.7349e-04 0.4% 9.9360e+03 0.0% 0.000e+00 0.0% 0.
,→000e+00 0.0% 2.000e+00 0.9%
3: KSPSolve 0: 1.7690e-03 1.0% 2.9673e+05 0.0% 1.520e+02 18.4% 1.
,→800e+02 0.1% 3.900e+01 18.1%
4: Load System 1: 1.0056e-01 54.4% 0.0000e+00 0.0% 3.700e+01 4.5% 5.
,→657e+05 62.4% 2.200e+01 10.2%
5: KSPSetUp 1: 5.6883e-03 3.1% 2.1205e+07 2.3% 0.000e+00 0.0% 0.
,→000e+00 0.0% 2.000e+00 0.9%
6: KSPSolve 1: 7.2578e-02 39.3% 9.1979e+08 97.7% 6.000e+02 72.8% 2.
,→098e+04 37.5% 1.200e+02 55.6%
--------------------------------------------------------------------------------------
,→----------------------------------
--------------------------------------------------------------------------------------
,→----------------------------------
...
--- Event Stage 3: KSPSolve 0
We next focus on the summaries for the various phases of the computation, as given in the table within the
following listing. The summary for each phase presents the maximum times and flop rates over all processes,
as well as the ratio of maximum to minimum times and flop rates for all processes. A ratio of approximately
1 indicates that computations within a given phase are well balanced among the processes; as the ratio
increases, the balance becomes increasingly poor. Also, the total computational rate (in units of MFlop/sec)
is given for each phase in the final column of the phase summary table.
Total Mflop/sec = 10−6 ∗ (sum of flop over all processors)/(max time over all processors)
Note: Total computational rates < 1 MFlop are listed as 0 in this column of the phase summary table.
Additional statistics for each phase include the total number of messages sent, the average message length,
and the number of global reductions.
mpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -
,→mat_type baij \
--------------------------------------------------------------------------------------
,→----------------------------------
...
MatMult 60 1.0 2.4068e-02 1.1 6.54e+07 1.0 6.0e+02 2.1e+04 0.0e+00 12␣
,→27 73 37 0 32 28100100 0 10731
MatSolve 61 1.0 1.9177e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 10␣
,→25 0 0 0 26 26 0 0 0 12491
VecMDot 59 1.0 1.4741e-02 1.3 4.86e+07 1.0 0.0e+00 0.0e+00 5.9e+01 7␣
,→21 0 0 27 18 21 0 0 49 13189
VecNorm 61 1.0 3.0417e-03 1.4 3.29e+06 1.0 0.0e+00 0.0e+00 6.1e+01 1 ␣
,→1 0 0 28 4 1 0 0 51 4332
VecScale 61 1.0 9.9802e-04 1.0 1.65e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 ␣
,→1 0 0 0 1 1 0 0 0 6602
VecCopy 2 1.0 5.9128e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 ␣
,→0 0 0 0 0 0 0 0 0 0
VecSet 64 1.0 8.0323e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 ␣
,→0 0 0 0 1 0 0 0 0 0
VecAXPY 3 1.0 7.4387e-05 1.1 1.62e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 ␣
,→0 0 0 0 0 0 0 0 0 8712
VecMAXPY 61 1.0 8.8558e-03 1.1 5.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00 5␣
,→22 0 0 0 12 23 0 0 0 23393
VecScatterBegin 60 1.0 9.6416e-04 1.8 0.00e+00 0.0 6.0e+02 2.1e+04 0.0e+00 0 ␣
,→0 73 37 0 1 0100100 0 0
VecScatterEnd 60 1.0 6.1543e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 ␣
,→0 0 0 0 8 0 0 0 0 0
VecNormalize 61 1.0 4.2675e-03 1.3 4.94e+06 1.0 0.0e+00 0.0e+00 6.1e+01 2 ␣
,→2 0 0 28 5 2 0 0 51 4632
KSPGMRESOrthog 59 1.0 2.2627e-02 1.1 9.72e+07 1.0 0.0e+00 0.0e+00 5.9e+01 11␣
,→41 0 0 27 29 42 0 0 49 17185
KSPSolve 1 1.0 7.2577e-02 1.0 2.31e+08 1.0 6.0e+02 2.1e+04 1.2e+02 39␣
,→98 73 37 56 99100100100100 12673
PCSetUpOnBlocks 1 1.0 9.5367e-07 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 ␣
,→0 0 0 0 0 0 0 0 0 0
PCApply 61 1.0 2.0427e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 11␣
,→25 0 0 0 28 26 0 0 0 11726
--------------------------------------------------------------------------------------
,→----------------------------------
As discussed in the preceding section, the performance summaries for higher-level PETSc routines include
the statistics for the lower levels of which they are made up. For example, the communication within
matrix-vector products MatMult() consists of vector scatter operations, as given by the routines Vec-
ScatterBegin() and VecScatterEnd().
The final data presented are the percentages of the various statistics (time (%T), flop/sec (%F), messages(%M),
average message length (%L), and reductions (%R)) for each event relative to the total computation and to any
user-defined stages (discussed in Profiling Multiple Sections of Code). These statistics can aid in optimizing
performance, since they indicate the sections of code that could benefit from various kinds of tuning. Hints
for Performance Tuning gives suggestions about achieving good performance with PETSc codes.
The additional option -log_view_memory causes the display of additional columns of information about how
much memory was allocated and freed during each logged event. This is useful to understand what phases
of a computation require the most memory.
It is also possible to use the Jumpshot package [HL91] to visualize PETSc events. This package comes with
the MPE software, which is part of the MPICH [Getal] implementation of MPI. The option
-log_mpe [logfile]
creates a logfile of events appropriate for viewing with Jumpshot. The user can either use the default logging
file or specify a name via logfile. Events can be deactivated as described in Restricting Event Logging.
The user can also log MPI events. To do this, simply consider the PETSc application as any MPI application,
and follow the MPI implementation’s instructions for logging MPI calls. For example, when using MPICH,
this merely required adding -llmpich to the library list before -lmpich.
It is possible to output the PETSc logging information in a nested format where the hierarchy of events is
explicit. This output can be generated either as an XML file or as a text file in a format suitable for viewing
as a flame graph.
One can generate the XML output by passing the option -log_view :[logfilename]:ascii_xml. It
can be viewed by copying ${PETSC_DIR}/share/petsc/xml/performance_xml2html.xsl into the
current directory, then opening the logfile in your browser.
The flame graph output can be generated with the option -log_view :[logfile]:ascii_flamegraph.
It can then be visualised with either FlameGraph or speedscope.
Note that user-defined stages (see Profiling Multiple Sections of Code) will be ignored when using this nested
format.
PETSc automatically logs object creation, times, and floating-point counts for the library routines. Users
can easily supplement this information by profiling their application codes as well. The basic steps involved
in logging a user-defined portion of code, called an event, are shown in the code fragment below:
PetscLogEvent USER_EVENT;
PetscClassId classid;
PetscLogDouble user_event_flops;
PetscClassIdRegister("class name",&classid);
PetscLogEventRegister("User event name",classid,&USER_EVENT);
PetscLogEventBegin(USER_EVENT,0,0,0,0);
/* code segment to monitor */
(continues on next page)
One must register the event by calling PetscLogEventRegister(), which assigns a unique integer to
identify the event for profiling purposes:
PetscLogEventRegister(const char string[],PetscClassId classid,PetscLogEvent *e);
Here string is a user-defined event name, and color is an optional user-defined event color (for use with
Jumpshot logging; see Using -log_mpe with Jumpshot); one should see the manual page for details. The ar-
gument returned in e should then be passed to the PetscLogEventBegin() and PetscLogEventEnd()
routines.
Events are logged by using the pair
PetscLogEventBegin(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject␣
,→o4);
The four objects are the PETSc objects that are most closely associated with the event. For instance, in
a matrix-vector product they would be the matrix and the two vectors. These objects can be omitted by
specifying 0 for o1 - o4. The code between these two routine calls will be automatically timed and logged
as part of the specified event.
The user can log the number of floating-point operations for this segment of code by calling
PetscLogFlops(number of flop for this code segment);
between the calls to PetscLogEventBegin() and PetscLogEventEnd(). This value will automatically
be added to the global flop counter for the entire program.
By default, the profiling produces a single set of statistics for all code between the PetscInitialize()
and PetscFinalize() calls within a program. One can independently monitor several “stages” of code
by switching among the various stages with the commands
PetscLogStagePush(PetscLogStage stage);
PetscLogStagePop();
allows one to associate a name with a stage; these names are printed whenever summaries are generated
with -log_view. The following code fragment uses three profiling stages within an program.
PetscInitialize(int *argc,char ***args,0,0);
/* stage 0 of code here */
PetscLogStageRegister("Stage 0 of Code", &stagenum0);
for (i=0; i<ntimes; i++) {
PetscLogStageRegister("Stage 1 of Code", &stagenum1);
PetscLogStagePush(stagenum1);
/* stage 1 of code here */
(continues on next page)
The listings above show output generated by -log_view for a program that employs several profiling
stages. In particular, this program is subdivided into six stages: loading a matrix and right-hand-side
vector from a binary file, setting up the preconditioner, and solving the linear system; this sequence is then
repeated for a second linear system. For simplicity, the second listing contains output only for stages 5 and
6 (linear solve of the second system), which comprise the part of this computation of most interest to us
in terms of performance monitoring. This code organization (solving a small linear system followed by a
larger system) enables generation of more accurate profiling statistics for the second system by overcoming
the often considerable overhead of paging, as discussed in Accurate Profiling and Paging Overheads.
By default, all PETSc operations are logged. To enable or disable the PETSc logging of individual events,
one uses the commands
PetscLogEventActivate(PetscLogEvent event);
PetscLogEventDeactivate(PetscLogEvent event);
The event may be either a predefined PETSc event (as listed in the file $PETSC_DIR/include/
petsclog.h) or one obtained with PetscLogEventRegister() (as described in Profiling Application
Codes).
PETSc also provides routines that deactivate (or activate) logging for entire components of the library.
Currently, the components that support such logging (de)activation are Mat (matrices), Vec (vectors), KSP
(linear solvers, including KSP and PC), and SNES (nonlinear solvers):
PetscLogEventDeactivateClass(MAT_CLASSID);
PetscLogEventDeactivateClass(KSP_CLASSID); /* includes PC and KSP */
PetscLogEventDeactivateClass(VEC_CLASSID);
PetscLogEventDeactivateClass(SNES_CLASSID);
and
PetscLogEventActivateClass(MAT_CLASSID);
PetscLogEventActivateClass(KSP_CLASSID); /* includes PC and KSP */
PetscLogEventActivateClass(VEC_CLASSID);
PetscLogEventActivateClass(SNES_CLASSID);
Users can activate the printing of verbose information about algorithms, data structures, etc. to the screen
by using the option -info or by calling PetscInfoAllow(PETSC_TRUE). Such logging, which is used
throughout the PETSc libraries, can aid the user in understanding algorithms and tuning program perfor-
mance. For example, as discussed in Sparse Matrices, -info activates the printing of information about
memory allocation during matrix assembly.
Application programmers can employ this logging as well, by using the routine
where obj is the PETSc object associated most closely with the logging statement, message. For example,
in the line search Newton methods, we use a statement such as
One can selectively turn off informative messages about any of the basic PETSc objects (e.g., Mat, SNES)
with the command
PetscInfoDeactivateClass(int object_classid)
where object_classid is one of MAT_CLASSID, SNES_CLASSID, etc. Messages can be reactivated with
the command
PetscInfoActivateClass(int object_classid)
Such deactivation can be useful when one wishes to view information about higher-level PETSc libraries
(e.g., TS and SNES) without seeing all lower level data as well (e.g., Mat). One can deactivate events at
runtime for matrix and linear solver libraries via -info [no_mat, no_ksp].
4.3.6 Time
PETSc application programmers can access the wall clock time directly with the command
PetscLogDouble time;
PetscCall(PetscTime(&time));
which returns the current time in seconds since the epoch, and is commonly implemented with MPI_Wtime.
A floating point number is returned in order to express fractions of a second. In addition, as discussed in
Profiling Application Codes, PETSc can automatically profile user-defined segments of code.
All output from PETSc programs (including informative messages, profiling information, and convergence
data) can be saved to a file by using the command line option -history [filename]. If no file name
is specified, the output is stored in the file ${HOME}/.petschistory. Note that this option only saves
output printed with the PetscPrintf() and PetscFPrintf() commands, not the standard printf()
and fprintf() statements.
One factor that often plays a significant role in profiling a code is paging by the operating system. Generally,
when running a program, only a few pages required to start it are loaded into memory rather than the entire
executable. When the execution proceeds to code segments that are not in memory, a pagefault occurs,
prompting the required pages to be loaded from the disk (a very slow process). This activity distorts the
results significantly. (The paging effects are noticeable in the log files generated by -log_mpe, which is
described in Using -log_mpe with Jumpshot.)
To eliminate the effects of paging when profiling the performance of a program, we have found an effective
procedure is to run the exact same code on a small dummy problem before running it on the actual problem
of interest. We thus ensure that all code required by a solver is loaded into memory during solution of the
small problem. When the code proceeds to the actual (larger) problem of interest, all required pages have
already been loaded into main memory, so that the performance numbers are not distorted.
When this procedure is used in conjunction with the user-defined stages of profiling described in Profiling
Multiple Sections of Code, we can focus easily on the problem of interest. For example, we used this
technique in the program KSP Tutorial ex10 to generate the timings within listing and listing. In this case,
the profiled code of interest (solving the linear system for the larger problem) occurs within event stages 5 and
6. Interpreting -log_view Output: Parallel Performance provides details about interpreting such profiling
data.
In particular, the macros
and
PetscPreLoadEnd()
can be used to easily convert a regular PETSc program to one that uses preloading. The command line
options -preload true and -preload false may be used to turn on and off preloading at run time for
PETSc programs that use these macros.
When a CUDA executable is preceeded by nsys profile -f true -o filename, the default event
profiling will add annotations to the Nsight Systems data that can help the navigation of file-name with
the Nsight Systems GUI (https://developer.nvidia.com/nsight-systems). The Nsight Systems GUI lets you
see a timeline of code performance information like kernels, mallocs, CPU-GPU communication, and high-
level data like register usage in a kernel, among other things in a popup window when the mouse hovers over
the section. The PETSc events are automatically also displayed in Nsight if the option -log_view is also
used.
For and MPI parallel job, only one process can call nsys. For example one can in a bash script use something
like:
TAU profiles can be generated without the need for instrumentation through the use of the perfstubs package.
PETSc by default is configured with --with-tau-perfstubs. To generate profiles with TAU, first setup
TAU:
wget http://tau.uoregon.edu/tau.tgz
./configure -cc=mpicc -c++=mpicxx -mpi -bfd=download -unwind=download && make install
export PATH=<tau dir>/x86_64/bin:$PATH
For more information on configuring TAU, see http://tau.uoregon.edu. Next, run your program with TAU.
For instance, to profile ex56,
cd $PETSC_DIR/src/snes/tutorials
make ex56
mpirun -n 4 tau_exec -T mpi ./ex56 <args>
This should produce four profile.* files with profile data that can be viewed with paraprof/pprof:
This chapter provides hints on how to get to achieve best performance with PETSc, particularly on
distributed-memory machines with multiple CPU sockets per node. We focus on machine-related perfor-
mance optimization here; algorithmic aspects like preconditioner selection are not the focus of this section.
Most operations in PETSc deal with large datasets (typically vectors and sparse matrices) and perform
relatively few arithmetic operations for each byte loaded or stored from global memory. Therefore, the
arithmetic intensity expressed as the ratio of floating point operations to the number of bytes loaded and
stored is usually well below unity for typical PETSc operations. On the other hand, modern CPUs are able to
execute on the order of 10 floating point operations for each byte loaded or stored. As a consequence, almost
all PETSc operations are limited by the rate at which data can be loaded or stored (memory bandwidth
limited) rather than by the rate of floating point operations.
This section discusses ways to maximize the memory bandwidth achieved by applications based on PETSc.
Where appropriate, we include benchmark results in order to provide quantitative results on typical perfor-
mance gains one can achieve through parallelization, both on a single compute node and across nodes. In
particular, we start with the answer to the common question of why performance generally does not increase
20-fold with a 20-core CPU.
Consider the addition of two large vectors, with the result written to a third vector. Because there are no
dependencies across the different entries of each vector, the operation is embarrasingly parallel.
100
10
E5-2670 v3 (Haswell)
E5-2650 v2 (Ivy Bridge)
E5-2620 (Sandy Bridge)
Xeon Phi 7120 (Knights Corner)
Xeon Phi 7250 (Knights Landing), DDR4
Xeon Phi 7250 (Knights Landing), MCDRAM
1
1 10 100
Processes/Threads
Fig. 4.1: Memory bandwidth obtained on Intel hardware (dual socket except KNL) over the number of
processes used. One can get close to peak memory bandwidth with only a few processes.
As Fig. 4.1 shows, the performance gains due to parallelization on different multi- and many-core CPUs
quickly saturates. The reason is that only a fraction of the total number of CPU cores is required to
saturate the memory channels. For example, a dual-socket system equipped with Haswell 12-core Xeon
CPUs achieves more than 80 percent of achievable peak memory bandwidth with only four processes per
socket (8 total), cf. Fig. 4.1. Consequently, running with more than 8 MPI ranks on such a system will not
increase performance substantially. For the same reason, PETSc-based applications usually do not benefit
from hyper-threading.
PETSc provides a simple way to measure memory bandwidth for different numbers of processes via the
target make streams executed from $PETSC_DIR. The output provides an overview of the possible
speedup one can obtain on the given machine (not necessarily a shared memory system). For example, the
following is the most relevant output obtained on a dual-socket system equipped with two six-core-CPUs
with hyperthreading:
np speedup
1 1.0
2 1.58
3 2.19
4 2.42
5 2.63
6 2.69
...
21 3.82
22 3.49
23 3.79
24 3.71
Estimation of possible speedup of MPI programs based on Streams benchmark.
It appears you have 1 node(s)
On this machine, one should expect a speed-up of typical memory bandwidth-bound PETSc applications of
at most 4x when running multiple MPI ranks on the node. Most of the gains are already obtained when
running with only 4-6 ranks. Because a smaller number of MPI ranks usually implies better preconditioners
and better performance for smaller problems, the best performance for PETSc applications may be obtained
with fewer ranks than there are physical CPU cores available.
Following the results from the above run of make streams, we recommend to use additional nodes instead
of placing additional MPI ranks on the nodes. In particular, weak scaling (i.e. constant load per process,
increasing the number of processes) and strong scaling (i.e. constant total work, increasing the number of
processes) studies should keep the number of processes per node constant.
CPUs in nodes with more than one CPU socket are internally connected via a high-speed fabric, cf. Fig. 4.2,
to enable data exchange as well as cache coherency. Because main memory on modern systems is connected
via the integrated memory controllers on each CPU, memory is accessed in a non-uniform way: A process
running on one socket has direct access to the memory channels of the respective CPU, whereas requests for
memory attached to a different CPU socket need to go through the high-speed fabric. Consequently, best
aggregate memory bandwidth on the node is obtained when the memory controllers on each CPU are fully
saturated. However, full saturation of memory channels is only possible if the data is distributed across the
different memory channels.
CPU CPU
socket Interconnect socket
Main Memory Main Memory
Fig. 4.2: Schematic of a two-socket NUMA system. Processes should be spread across both CPUs to obtain
full bandwidth.
Data in memory on modern machines is allocated by the operating system based on a first-touch policy. That
is, memory is not allocated at the point of issuing malloc(), but at the point when the respective memory
segment is actually touched (read or write). Upon first-touch, memory is allocated on the memory channel
associated with the respective CPU the process is running on. Only if all memory on the respective CPU is
already in use (either allocated or as IO cache), memory available through other sockets is considered.
Maximum memory bandwidth can be achieved by ensuring that processes are spread over all sockets in the
respective node. For example, the recommended placement of a 8-way parallel run on a four-socket machine
is to assign two processes to each CPU socket. To do so, one needs to know the enumeration of cores and pass
the requested information to mpirun. Consider the hardware topology information returned by lstopo
(part of the hwloc package) for the following two-socket machine, in which each CPU consists of six cores
and supports hyperthreading:
Machine (126GB total)
NUMANode L#0 (P#0 63GB)
Package L#0 + L3 L#0 (15MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#12)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#13)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#14)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#15)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#16)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#17)
NUMANode L#1 (P#1 63GB)
Package L#1 + L3 L#1 (15MB)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#18)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#19)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#20)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#21)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#22)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#23)
The relevant physical processor IDs are shown in parentheses prefixed by P#. Here, IDs 0 and 12 share the
same physical core and have a common L2 cache. IDs 0, 12, 1, 13, 2, 14, 3, 15, 4, 16, 5, 17 share the same
socket and have a common L3 cache.
A good placement for a run with six processes is to locate three processes on the first socket and three
processes on the second socket. Unfortunately, mechanisms for process placement vary across MPI imple-
mentations, so make sure to consult the manual of your MPI implementation. The following discussion is
based on how processor placement is done with MPICH and OpenMPI, where one needs to pass --bind-to
core --map-by socket to mpirun:
In this configuration, process 0 is bound to the first physical core on the first socket (with IDs 0 and 12),
process 1 is bound to the first core on the second socket (IDs 6 and 18), and similarly for the remaining
processes. The achieved bandwidth of 45 GB/sec is close to the practical peak of about 50 GB/sec available
on the machine. If, however, all MPI processes are located on the same socket, memory bandwidth drops
significantly:
All processes are now mapped to cores on the same socket. As a result, only the first memory channel is
fully saturated at 25.5 GB/sec.
One must not assume that mpirun uses good defaults. To demonstrate, compare the full output of make
streams from Memory Bandwidth vs. Processes on the left with the results on the right obtained by
passing --bind-to core --map-by socket:
$ make streams
np speedup
1 1.0
2 1.58
3 2.19
4 2.42
5 2.63
6 2.69
7 2.31
8 2.42
9 2.37
10 2.65
11 2.3
12 2.53
13 2.43
14 2.63
15 2.74
16 2.7
17 3.28
18 3.66
(continues on next page)
For the non-optimized version on the left, the speedup obtained when using any number of processes
between 3 and 13 is essentially constant up to fluctuations, indicating that all processes were by default
executed on the same socket. Only with 14 or more processes, the speedup number increases again. In
contrast, the results of make streams with proper processor placement shown on the right resulted in
slightly higher overall parallel speedup (identical baselines), in smaller performance fluctuations, and more
than 90 percent of peak bandwidth with only six processes.
Machines with job submission systems such as SLURM usually provide similar mechanisms for processor
placements through options specified in job submission scripts. Please consult the respective manuals.
For a typical, memory bandwidth-limited PETSc application, the primary consideration in placing MPI pro-
cesses is ensuring that processes are evenly distributed among sockets, and hence using all available memory
channels. Increasingly complex processor designs and cache hierarchies, however, mean that performance
may also be sensitive to how processes are bound to the resources within each socket. Performance on the
two processor machine in the preceding example may be relatively insensitive to such placement decisions,
because one L3 cache is shared by all cores within a NUMA domain, and each core has its own L2 and L1
caches. However, processors that are less “flat”, with more complex hierarchies, may be more sensitive. In
many AMD Opterons or the second-generation “Knights Landing” Intel Xeon Phi, for instance, L2 caches
are shared between two cores. On these processors, placing consecutive MPI ranks on cores that share the
same L2 cache may benefit performance if the two ranks communicate frequently with each other, because
the latency between cores sharing an L2 cache may be roughly half that of two cores not sharing one. There
may be benefit, however, in placing consecutive ranks on cores that do not share an L2 cache, because (if
there are fewer MPI ranks than cores) this increases the total L2 cache capacity and bandwidth available to
the application. There is a trade-off to be considered between placing processes close together (in terms of
shared resources) to optimize for efficient communication and synchronization vs. farther apart to maximize
available resources (memory channels, caches, I/O channels, etc.), and the best strategy will depend on the
application and the software and hardware stack.
Different process placement strategies can affect performance at least as much as some commonly explored
settings, such as compiler optimization levels. Unfortunately, exploration of this space is complicated by
two factors: First, processor and core numberings may be completely arbitrary, changing with BIOS version,
etc., and second—as already noted—there is no standard mechanism used by MPI implementations (or job
schedulers) to specify process affinity. To overcome the first issue, we recommend using the lstopo utility
of the Portable Hardware Locality (hwloc) software package (which can be installed by configuring PETSc
with –download-hwloc) to understand the processor topology of your machine. We cannot fully address
the second issue—consult the documenation for your MPI implementation and/or job scheduler—but we
offer some general observations on understanding placement options:
• An MPI implementation may support a notion of domains in which a process may be pinned. A domain
may simply correspond to a single core; however, the MPI implementation may allow a deal of flexibility
in specifying domains that encompass multiple cores, span sockets, etc. Some implementations, such
as Intel MPI, provide means to specify whether domains should be “compact”—composed of cores
sharing resources such as caches—or “scatter”-ed, with little resource sharing (possibly even spanning
sockets).
• Separate from the specification of domains, MPI implementations often support different orderings in
which MPI ranks should be bound to these domains. Intel MPI, for instance, supports “compact”
ordering to place consecutive ranks close in terms of shared resources, “scatter” to place them far
apart, and “bunch” to map proportionally to sockets while placing ranks as close together as possible
within the sockets.
• An MPI implemenation that supports process pinning should offer some way to view the rank assign-
ments. Use this output in conjunction with the topology obtained via lstopo or a similar tool to
determine if the placements correspond to something you believe is reasonable for your application.
Do not assume that the MPI implementation is doing something sensible by default!
This section looks into a potpourri of performance pitfalls encountered by users in the past. Many of these
pitfalls require a deeper understanding of the system and experience to detect. The purpose of this section
is to summarize and share our experience so that these pitfalls can be avoided in the future.
PETSc’s configure defaults to building PETSc with debug mode enabled. Any code development should
be done in this mode, because it provides handy debugging facilities such as accurate stack traces, memory
leak checks, and memory corruption checks. Note that PETSc has no reliable way of knowing whether
a particular run is a production or debug run. In the case that a user requests profiling information via
-log_view, a debug build of PETSc issues the following warning:
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option, #
# To get timing results run configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
Conversely, one way of checking whether a particular build of PETSc has debugging enabled is to inspect
the output of -log_view.
Debug mode will generally be most useful for code development if appropriate compiler options are set to
faciliate debugging. The compiler should be instructed to generate binaries with debug symbols (command
line option -g for most compilers), and the optimization level chosen should either completely disable
optimizations (-O0 for most compilers) or enable only optimizations that do not interfere with debugging
(GCC, for instance, supports a -Og optimization level that does this).
Only once the new code is thoroughly tested and ready for production, one should disable debugging facilities
by passing --with-debugging=no to
configure. One should also ensure that an appropriate compiler optimization level is set. Note that some
compilers (e.g., Intel) default to fairly comprehensive optimization levels, while others (e.g., GCC) default
to no optimization at all. The best optimization flags will depend on your code, the compiler, and the target
architecture, but we offer a few guidelines for finding those that will offer the best performance:
• Most compilers have a number of optimization levels (with level n usually specified via -On) that pro-
vide a quick way to enable sets of several optimization flags. We suggest trying the higher optimization
levels (the highest level is not guaranteed to produce the fastest executable, so some experimentation
may be merited). With most recent processors now supporting some form of SIMD or vector instruc-
tions, it is important to choose a level that enables the compiler’s auto-vectorizer; many compilers do
not enable auto-vectorization at lower optimization levels (e.g., GCC does not enable it below -O3 and
the Intel compiler does not enable it below -O2).
• For processors supporting newer vector instruction sets, such as Intel AVX2 and AVX-512, it is also im-
portant to direct the compiler to generate code that targets these processors (e.g., -march=native);
otherwise, the executables built will not utilize the newer instructions sets and will not take advantage
of the vector processing units.
• Beyond choosing the optimization levels, some value-unsafe optimizations (such as using reciprocals
of values instead of dividing by those values, or allowing re-association of operands in a series of
calculations) for floating point calculations may yield significant performance gains. Compilers often
provide flags (e.g., -ffast-math in GCC) to enable a set of these optimizations, and they may be
turned on when using options for very aggressive optimization (-fast or -Ofast in many compilers).
These are worth exploring to maximize performance, but, if employed, it important to verify that these
do not cause erroneous results with your code, since calculations may violate the IEEE standard for
floating-point arithmetic.
Profiling
Users should not spend time optimizing a code until after having determined where it spends the bulk of its
time on realistically sized problems. As discussed in detail in Profiling, the PETSc routines automatically
log performance data if certain runtime options are specified.
To obtain a summary of where and how much time is spent in different sections of the code, use one of the
following options:
• Run the code with the option -log_view to print a performance summary for various phases of the
code.
• Run the code with the option -log_mpe [logfilename], which creates a logfile of events suitable
for viewing with Jumpshot (part of MPICH).
Then, focus on the sections where most of the time is spent. If you provided your own callback routines,
e.g. for residual evaluations, search the profiling output for routines such as SNESFunctionEval or SNES-
JacobianEval. If their relative time is significant (say, more than 30 percent), consider optimizing these
routines first. Generic instructions on how to optimize your callback functions are difficult; you may start
by reading performance optimization guides for your system’s hardware.
Aggregation
Performing operations on chunks of data rather than a single element at a time can significantly enhance
performance because of cache reuse or lower data motion. Typical examples are:
• Insert several (many) elements of a matrix or vector at once, rather than looping and inserting a single
value at a time. In order to access elements in of vector repeatedly, employ VecGetArray() to allow
direct manipulation of the vector elements.
• When possible, use VecMDot() rather than a series of calls to VecDot().
• If you require a sequence of matrix-vector products with the same matrix, consider packing your vectors
into a single matrix and use matrix-matrix multiplications.
• Users should employ a reasonable number of PetscMalloc() calls in their codes. Hundreds or
thousands of memory allocations may be appropriate; however, if tens of thousands are being used,
then reducing the number of PetscMalloc() calls may be warranted. For example, reusing space
or allocating large chunks and dividing it into pieces can produce a significant savings in allocation
overhead. Data Structure Reuse gives details.
Aggressive aggregation of data may result in inflexible datastructures and code that is hard to maintain. We
advise users to keep these competing goals in mind and not blindly optimize for performance only.
When symbolically factoring an AIJ matrix, PETSc has to guess how much fill there will be. Careful
use of the fill parameter in the MatFactorInfo structure when calling MatLUFactorSymbolic() or
MatILUFactorSymbolic() can reduce greatly the number of mallocs and copies required, and thus greatly
improve the performance of the factorization. One way to determine a good value for the fill parameter is to
run a program with the option -info. The symbolic factorization phase will then print information such as
This indicates that the user should have used a fill estimate factor of about 2.17 (instead of 1) to prevent
the 12 required mallocs and copies. The command line option
-pc_factor_fill 2.17
will cause PETSc to preallocate the correct amount of space for the factorization.
PETSc provides a number of tools to aid in detection of problems with memory allocation, including leaks
and use of uninitialized space. We briefly describe these below.
• The PETSc memory allocation (which collects statistics and performs error checking), is employed
by default for codes compiled in a debug-mode (configured with --with-debugging=1). PETSc
memory allocation can be activated for optimized-mode (configured with --with-debugging=0)
using the option -malloc_debug. The option -malloc_debug 0 forces the use of conventional
memory allocation when debugging is enabled. When running timing tests, one should build libraries
in optimized mode.
• When the PETSc memory allocation routines are used, the option -malloc_dump will print a list of
unfreed memory at the conclusion of a program. If all memory has been freed, only a message stating the
maximum allocated space will be printed. However, if some memory remains unfreed, this information
will be printed. Note that the option -malloc_dump activates a call to PetscMallocDump() during
PetscFinalize() the user can also call PetscMallocDump() elsewhere in a program.
• Another useful option for use with PETSc memory allocation routines is -malloc_view, which acti-
vates logging of all calls to malloc and reports memory usage, including all Fortran arrays. This option
provides a more complete picture than -malloc_dump for codes that employ Fortran with hard-
wired arrays. The option -malloc_view activates logging by calling PetscMallocViewSet() in
PetscInitialize() and then prints the log by calling PetscMallocView() in PetscFinal-
ize(). The user can also call these routines elsewhere in a program. When finer granularity is desired,
the user should call PetscMallocGetCurrentUsage() and PetscMallocGetMaximumUsage()
for memory allocated by PETSc, or PetscMemoryGetCurrentUsage() and PetscMemoryGet-
MaximumUsage() for the total memory used by the program. Note that PetscMemorySetGet-
MaximumUsage() must be called before PetscMemoryGetMaximumUsage() (typically at the be-
ginning of the program).
• When running with -log_view the additional option -log_view_memory causes the display of additional
columns of information about how much memory was allocated and freed during each logged event.
This is useful to understand what phases of a computation require the most memory.
Data structures should be reused whenever possible. For example, if a code often creates new matrices or
vectors, there often may be a way to reuse some of them. Very significant performance improvements can be
achieved by reusing matrix data structures with the same nonzero pattern. If a code creates thousands of
matrix or vector objects, performance will be degraded. For example, when solving a nonlinear problem or
timestepping, reusing the matrices and their nonzero structure for many steps when appropriate can make
the code run significantly faster.
A simple technique for saving work vectors, matrices, etc. is employing a user-defined context. In C and
C++ such a context is merely a structure in which various objects can be stashed; in Fortran a user context
can be an integer array that contains both parameters and pointers to PETSc objects. See SNES Tutorial ex5
and SNES Tutorial ex5f90 for examples of user-defined application contexts in C and Fortran, respectively.
Numerical Experiments
PETSc users should run a variety of tests. For example, there are a large number of options for the linear and
nonlinear equation solvers in PETSc, and different choices can make a very big difference in convergence rates
and execution times. PETSc employs defaults that are generally reasonable for a wide range of problems,
but clearly these defaults cannot be best for all cases. Users should experiment with many combinations to
determine what is best for a given problem and customize the solvers accordingly.
• Use the options -snes_view, -ksp_view, etc. (or the routines KSPView(), SNESView(), etc.)
to view the options that have been used for a particular solver.
• Run the code with the option -help for a list of the available runtime commands.
• Use the option -info to print details about the solvers’ operation.
• Use the PETSc monitoring discussed in Profiling to evaluate the performance of various numerical
methods.
As discussed in KSP: Linear System Solvers, the default linear solvers are
• uniprocess: GMRES(30) with ILU(0) preconditioning
• multiprocess: GMRES(30) with block Jacobi preconditioning, where there is 1 block per process, and
each block is solved with ILU(0)
One should experiment to determine alternatives that may be better for various applications. Recall that
one can specify the KSP methods and preconditioners at runtime via the options:
One can also specify a variety of runtime customizations for the solvers, as discussed throughout the manual.
In particular, note that the default restart parameter for GMRES is 30, which may be too small for some
large-scale problems. One can alter this parameter with the option -ksp_gmres_restart <restart>
or by calling KSPGMRESSetRestart(). Krylov Methods gives information on setting alternative GMRES
orthogonalization routines, which may provide much better parallel performance.
For elliptic problems one often obtains good performance and scalability with multigrid solvers. Consult
Algebraic Multigrid (AMG) Preconditioners for available options. Our experience is that GAMG works
particularly well for elasticity problems, whereas hypre does well for scalar problems.
System-Related Problems
The performance of a code can be affected by a variety of factors, including the cache behavior, other users
on the machine, etc. Below we briefly describe some common problems and possibilities for overcoming
them.
• Problem too large for physical memory size: When timing a program, one should always leave
at least a ten percent margin between the total memory a process is using and the physical size of the
machine’s memory. One way to estimate the amount of memory used by given process is with the Unix
getrusage system routine. The PETSc option -malloc_view reports all memory usage, including
any Fortran arrays in an application code.
• Effects of other users: If other users are running jobs on the same physical processor nodes on which
a program is being profiled, the timing results are essentially meaningless.
• Overhead of timing routines on certain machines: On certain machines, even calling the system
clock in order to time routines is slow; this skews all of the flop rates and timing results. The file
$PETSC_DIR/src/benchmarks/PetscTime.c (source) contains a simple test problem that will
approximate the amount of time required to get the current time in a running program. On good
systems it will on the order of 10−6 seconds or less.
• Problem too large for good cache performance: Certain machines with lower memory band-
widths (slow memory access) attempt to compensate by having a very large cache. Thus, if a significant
portion of an application fits within the cache, the program will achieve very good performance; if the
code is too large, the performance can degrade markedly. To analyze whether this situation affects a
particular code, one can try plotting the total flop rate as a function of problem size. If the flop rate
decreases rapidly at some point, then the problem may likely be too large for the cache size.
• Inconsistent timings: Inconsistent timings are likely due to other users on the machine, thrashing
(using more virtual memory than available physical memory), or paging in of the initial executable.
Accurate Profiling and Paging Overheads provides information on overcoming paging overhead when
profiling a code. We have found on all systems that if you follow all the advise above your timings will
be consistent within a variation of less than five percent.
1. BLAS 1 operations (and GPU equivalents) - vector operations such as VecNorm(), VecAXPY(), and
VecScale() are used extensively in PETSc. Depending on the simulation the size of the vectors may
be from hundreds of entries to many millions.
2. BLAS 2 operations - dense matrix with vector operations, generally the dense matrices are very small.
3. Eigenvalue and SVD computations, generally for very small matrices
4. External packages such as MUMPS and SuperLU_DIST use BLAS 3 operations (and possibly BLAS
1 and 2). The dense matrices may be of modest size, going up to thousands of rows and columns.
For most PETSc simulations (that is not using certain external packages) using an optimized set of
BLAS/LAPACK routines only provides a modest improvement in performance. For some external pack-
ages using optimized BLAS/LAPACK can make a dramatic improvement in performance.
BLAS/LAPACK libraries may use 32 or 64-bit integers. PETSc configure and compile handles this automat-
ically so long at the arguments to the BLAS/LAPACK routines are set to the type PetscBLASInt. The
routine PetscBLASIntCast(PetscInt, PetscBLASInt *) casts a PetscInt to the BLAS/LAPACK
size. If the BLAS/LAPACK size is not large enough it generates an error. For the vast majority of sim-
ulations, even very large ones, 64-bit BLAS/LAPACK integers are not needed, even when 64-bit PETSc
integers are used, The configure option --with-64-bit-blas-indices attempts to locate and use a
64-bit integer version of BLAS/LAPACK library. Except for MKL CPardiso, the external packages do
not support using 64-bit BLAS/LAPACK integers so if you are using such packages you cannot use 64-bit
BLAS/LAPACK integers.
Some BLAS/LAPACK libraries can make use of shared memory parallelism within the function calls, gen-
erally using OpenMP, or possibly PThreads. If this feature is turned on, it is in addition to the MPI based
parallelism that PETSc is using. Thus it can result in over-subscription of hardware resources. For example,
if a system has 16 cores and PETSc is run with an MPI size of 16 then each core is assigned an MPI process.
But if the BLAS/LAPACK is running with OpenMP and 4 threads per process this results in 64 threads
competing to use 16 cores which will perform poorly.
If one elects to use both MPI parallelism and shared memory BLAS/LAPACK parallelism one should ensure
they do not over subscribe the hardware resources. Since PETSc does not natively use OpenMP this means
that phases of the computation that do not use BLAS/LAPACK will be under-subscribed, thus under-
utilizing the system. For PETSc simulations which do not use external packages there is generally no benefit
to using parallel BLAS/LAPACK. The environmental variable OMP_NUM_THREADS can be used to set the
number of threads used by each MPI process for its shared memory parallel BLAS/LAPACK. The additional
environmental variables OMP_PROC_BIND and OMP_PLACES may also need to be set appropriately for the
system to obtain good parallel performance with BLAS/LAPACK. The configure option --with-openmp
will trigger PETSc to try to locate and use a parallel BLAS/LAPACK library.
Certain external packages such as MUMPS may benefit from using parallel BLAS/LAPACK operations. See
the manual page MATSOLVERMUMPS for details on how one can restrict the number of MPI processes while
running MUMPS to utilize parallel BLAS/LAPACK.
Most systems (besides Microsoft Windows) come with pre-installed BLAS/LAPACK which are satisfactory
for many PETSc simulations.
The freely available Intel MKL mathematics libraries provide BLAS/LAPACK that are generally better
performing than the system provided libraries and are generally fine for most users.
For systems that do not provide BLAS/LAPACK, such as Microsoft Windows, PETSc pro-
vides the Fortran reference version --download-fblaslapack and a f2c generated C version
--download-f2cblaslapack (which also supports 128 bit real number computations). These libraries
are less optimized but useful to get started with PETSc easily.
PETSc also provides access to OpenBLAS via the --download-openblas configure option. OpenBLAS
uses some highly optimized operations but falls back on reference routines for many other operations. See
the OpenBLAS manual for more information. The configure option --download-openblas provides a
full BLAS/LAPACK implementation.
BLIS does not bundle LAPACK with it so PETSc’s configure attempts to locate a compatible system
LAPACK library to use if --download-blis is selected. One can use --download-f2cblaslapack
4.5. The Use of BLAS and LAPACK in PETSc and external libraries 269
PETSc/TAO Users Manual, Release 3.19.2
Users who wish to employ PETSc on only a subset of MPI processes within a larger parallel job, or who wish
to use a “manager” process to coordinate the work of “worker” PETSc processes, should specify an alternative
communicator for PETSC_COMM_WORLD by directly setting its value, for example to use an existing MPI
communicator comm,
Allowing the user to modify parameters and options easily at runtime is very desirable for many applications.
PETSc provides a simple mechanism to enable such customization. To print a list of available options for a
given program, simply specify the option -help at runtime, e.g.,
Note that all runtime options correspond to particular PETSc routines that can be explicitly called from
within a program to set compile-time defaults. For many applications it is natural to use a combination of
compile-time and runtime choices. For example, when solving a linear system, one could explicitly specify
use of the Krylov subspace solver BiCGStab by calling
KSPSetType(ksp, KSPBCGS);
One could then override this choice at runtime with the option
-ksp_type tfqmr
to select the Transpose-Free QMR algorithm. (See KSP: Linear System Solvers for details.)
The remainder of this section discusses details of runtime options.
Each PETSc process maintains a database of option names and values (stored as text strings). This database
is generated with the command PetscInitialize(), which is listed below in its C/C++ and Fortran
variants, respectively:
PetscInitialize(int *argc, char ***args, const char *file, const char *help); // C
The arguments argc and args (in the C/C++ version only) are the addresses of the usual command line
arguments, while the file is a name of an optional file that can contain additional options. By default this
file is called .petscrc in the user’s home directory. The user can also specify options via the environmental
variable PETSC_OPTIONS. The options are processed in the following order:
1. file
2. environmental variable
3. command line
Thus, the command line options supersede the environmental variable options, which in turn supersede the
options file.
The file format for specifying options is
-optionname possible_value
-anotheroptionname possible_value
...
All of the option names must begin with a dash (-) and have no intervening spaces. Note that the option
values cannot have intervening spaces either, and tab characters cannot be used between the option names
and values. For uniformity throughout PETSc, we employ the format -[prefix_]package_option (for
instance, -ksp_type, -mat_view ::info, or -mg_levels_ksp_type).
Users can specify an alias for any option name (to avoid typing the sometimes lengthy default name) by
adding an alias to the .petscrc file in the format
alias -newname -oldname
For example,
alias -kspt -ksp_type
alias -sd -start_in_debugger
Comments can be placed in the .petscrc file by using # in the first column of a line.
Options Prefixes
Options prefixes allow specific objects to be controlled from the options database. For instance, PCMG
gives prefixes to its nested KSP objects; one may control the coarse grid solver by adding the mg_coarse
prefix, for example -mg_coarse_ksp_type preonly. One may also use KSPSetOptionsPre-
fix(),DMSetOptionsPrefix() , SNESSetOptionsPrefix(), TSSetOptionsPrefix(), and simi-
lar functions to assign custom prefixes, useful for applications with multiple or nested solvers.
PETSc can load additional options from a file using PetscOptionsInsertFile(), which can also be
used from the command line, e.g. -options_file my_options.opts.
One can also use YAML files with PetscOptionsInsertFileYAML(). For example, the following file:
$$: ignored
$$tail: ignored
$$ans: &ans 42
$$eu: &eu 2.72
$$pi: &pi 3.14
opt:
bool: true
int: *ans
(continues on next page)
seq1: {
bool: *seq-bool,
int: *seq-int,
real: *seq-real,
str: *seq-str,
}
seq2:
bool:
- true
- false
int:
- 123
- 456
- 789
real:
- *pi
- *eu
str:
- rst
- uvw
- xyz
map:
- key0: 0
- key1: 1
- key2: 2
- $$: ignored
- $$tail: ignored
With -options_file, PETSc will parse the file as YAML if it ends in a standard YAML or JSON4
extension or if one uses a :yaml postfix, e.g. -options_file my_options.yaml or -options_file
my_options.txt:yaml
PETSc will also check the first line of the options file itself and parse the file as YAML if it matches certain
criteria, for example.
%YAML 1.2
---
name: value
and
---
name: value
User-Defined PetscOptions
Any subroutine in a PETSc program can add entries to the database with the command
PetscOptionsSetValue(PetscOptions options, char *name, char *value);
though this is rarely done. To locate options in the database, one should use the commands
PetscOptionsHasName(PetscOptions options, char *pre, char *name, PetscBool *flg);
PetscOptionsGetInt(PetscOptions options, char *pre, char *name, PetscInt *value,␣
,→PetscBool *flg);
All of these routines set flg=PETSC_TRUE if the corresponding option was found, flg=PETSC_FALSE
if it was not found. The optional argument pre indicates that the true name of the option is the given
name (with the dash “-” removed) prepended by the prefix pre. Usually pre should be set to NULL
(or PETSC_NULL_CHARACTER for Fortran); its purpose is to allow someone to rename all the options in
a package without knowing the names of the individual options. For example, when using block Jacobi
preconditioning, the KSP and PC methods used on the individual blocks can be controlled via the options
-sub_ksp_type and -sub_pc_type.
4 JSON is a subset of YAML
One useful means of keeping track of user-specified runtime options is use of -options_view, which prints
to stdout during PetscFinalize() a table of all runtime options that the user has specified. A related
option is -options_left, which prints the options table and indicates any options that have not been
requested upon a call to PetscFinalize(). This feature is useful to check whether an option has been
activated for a particular PETSc object (such as a solver or matrix format), or whether an option name may
have been accidentally misspelled.
PETSc employs a consistent scheme for examining, printing, and saving objects through commands of the
form
Here obj is a PETSc object of type XXX, where XXX is Mat, Vec, SNES, etc. There are several predefined
viewers.
• Passing in a zero (0) for the viewer causes the object to be printed to the screen; this is useful when
viewing an object in a debugger but should be avoided in source code.
• PETSC_VIEWER_STDOUT_SELF and PETSC_VIEWER_STDOUT_WORLD causes the object to be
printed to the screen.
• PETSC_VIEWER_DRAW_SELF PETSC_VIEWER_DRAW_WORLD causes the object to be drawn in a de-
fault X window.
• Passing in a viewer obtained by PetscViewerDrawOpen() causes the object to be displayed graph-
ically. See Graphics for more on PETSc’s graphics support.
• To save an object to a file in ASCII format, the user creates the viewer object with the command
PetscViewerASCIIOpen(MPI_Comm comm, char* file, PetscViewer *viewer). This
object is analogous to PETSC_VIEWER_STDOUT_SELF (for a communicator of MPI_COMM_SELF) and
PETSC_VIEWER_STDOUT_WORLD (for a parallel communicator).
• To save an object to a file in binary format, the user creates the viewer object with the com-
mand PetscViewerBinaryOpen(MPI_Comm comm, char* file, PetscViewerBinaryType
type, PetscViewer *viewer). Details of binary I/O are discussed below.
• Vector and matrix objects can be passed to a running MATLAB process with a viewer created by
PetscViewerSocketOpen(MPI_Comm comm, char *machine, int port, PetscViewer
*viewer). See Sending Data to an Interactive MATLAB Session.
The user can control the format of ASCII printed objects with viewers created by PetscViewerASCI-
IOpen() by calling
These routines generate parallel matrices and vectors if the viewer’s communicator has more than one process.
The particular matrix and vector formats are determined from the options database; see the manual pages
for details.
One can provide additional information about matrix data for matrices stored on disk by providing an
optional file matrixfilename.info, where matrixfilename is the name of the file containing the
matrix. The format of the optional file is the same as the .petscrc file and can (currently) contain the
following:
-matload_block_size <bs>
The block size indicates the size of blocks to use if the matrix is read into a block oriented data structure
(for example, MATMPIBAIJ). The diagonal information s1,s2,s3,... indicates which (block) diagonals
in the matrix have nonzero values. The info file is automatically created when VecView() or MatView()
is used with a binary viewer; hence if you save a matrix with a given block size with MatView(), then a
MatLoad() on that file will automatically use the saved block size.
Command-line options provide a particularly convenient way to view PETSc objects. All options of the form
-xxx_view accept colon(:)-separated compound arguments which specify a viewer type, format, and/or
destination (e.g. file name or socket) if appropriate. For example, to quickly export a binary file contain-
ing a matrix, one may use -mat_view binary:matrix.out, or to output to a MATLAB-compatible
ASCII file, one may use -mat_view ascii:matrix.m:ascii_matlab. See the PetscOptions-
GetViewer() man page for full details, as well as the XXXViewFromOptions() man pages (for instance,
PetscDrawSetFromOptions()) for many other convenient command-line options.
The PetscViewer format PETSC_VIEWER_LOAD_BALANCE will cause certain objects to display simple
measures of their imbalance. For example
will display
indicating that one process has 162 nonzero entries in the matrix, the average number of nonzeros per process
is 168 and the maximum number of nonzeros is 174. Similar for vectors one can see the load balancing with,
for example,
The measurements of load balancing can also be done within the program with calls to the appropriate
object viewer with the viewer format PETSC_VIEWER_LOAD_BALANCE.
The Scientific Application Web server, SAWs1 , allows one to monitor running PETSc applications from a
browser. To use SAWs you must configure PETSc with the option --download-saws. Options to use
SAWs include
• -saws_options - allows setting values in the PETSc options database via the browser (works only
on one process).
• -stack_view saws - allows monitoring the current stack frame that PETSc is in; refresh to see the
new location.
• -snes_monitor_saws, -ksp_monitor_saws - monitor the solvers’ iterations from the web
browser.
For each of these you need to point your browser to http://hostname:8080, for example http://
localhost:8080. Options that control behavior of SAWs include
• -saws_log filename - log all SAWs actions in a file.
• -saws_https certfile - use HTTPS instead of HTTP with a certificate.
• -saws_port_auto_select - have SAWs pick a port number instead of using 8080.
• -saws_port port - use port instead of 8080.
• -saws_root rootdirectory - local directory to which the SAWs browser will have read access.
• -saws_local - use the local file system to obtain the SAWS javascript files (they much be in
rootdirectory/js).
Also see the manual pages for PetscSAWsBlock(), PetscObjectSAWsTakeAccess(), PetscObject-
SAWsGrantAccess(), PetscObjectSAWsSetBlock(), PetscStackSAWsGrantAccess() Petsc-
StackSAWsTakeAccess(), KSPMonitorSAWs(), and SNESMonitorSAWs().
4.6.5 Debugging
PETSc programs may be debugged using one of the two options below.
• -start_in_debugger [noxterm,dbx,xxgdb,xdb,xldb,lldb] [-display name] - start all
processes in debugger
• -on_error_attach_debugger [noxterm,dbx,xxgdb,xdb,xldb,lldb] [-display name]
- start debugger only on encountering an error
Note that, in general, debugging MPI programs cannot be done in the usual manner of starting the pro-
gramming in the debugger (because then it cannot set up the MPI communication and remote processes).
By default on Linux systems the GNU debugger gdb is used, on macOS systems lldb is used
By default, the debugger will be started in a new xterm (Apple Terminal on macOS), to enable running
separate debuggers on each process, unless the option noxterm is used. In order to handle the MPI
startup phase, the debugger command cont should be used to continue execution of the program within the
debugger. Rerunning the program through the debugger requires terminating the first job and restarting the
processor(s); the usual run option in the debugger will not correctly handle the MPI startup and should not
1 Saws wiki on Bitbucket
be used. Not all debuggers work on all machines, the user may have to experiment to find one that works
correctly.
You can select a subset of the processes to be debugged (the rest just run without the debugger) with the
option
-debugger_ranks rank1,rank2,...
where you simply list the ranks you want the debugger to run with.
Errors are handled through the routine PetscError(). This routine checks a stack of error handlers and
calls the one on the top. If the stack is empty, it selects PetscTraceBackErrorHandler(), which tries
to print a traceback. A new error handler can be put on the stack with
The arguments to HandlerFunction() are the line number where the error occurred, the file in
which the error was detected, the corresponding directory, the error message, the error integer, and the
HandlerContext. The routine
PetscPopErrorHandler()
PetscAbortErrorHandler()
PetscAttachDebuggerErrorHandler()
-log_trace [filename]
where filename is optional. By default the traces are printed to the screen. This can also be set with the
command PetscLogTraceBegin(FILE*).
It is also possible to trap signals by using the command
PetscFPTrapPush(PetscFPTrap flag);
can be used within a program. A flag of PETSC_FP_TRAP_ON indicates that floating-point exceptions
should be trapped, while a value of PETSC_FP_TRAP_OFF (the default) indicates that they should be
ignored.
PetscFPTrapPop(void);
should be used to revert to the previous handling of floating point exceptions before the call to PetscFP-
TrapPush().
A small set of macros is used to make the error handling lightweight. These macros are used throughout
the PETSc libraries and can be employed by the application programmer as well. When an error is first
detected, one should set it by calling
The user should check the return codes for all PETSc routines (and possibly user-defined routines as well)
with
PetscCall(PetscRoutine(...));
PetscCall(PetscMalloc1(n, &ptr));
If this procedure is followed throughout all of the user’s libraries and codes, any error will by default generate
a clean traceback of the location of the error.
Note that the macro PETSC_FUNCTION_NAME is used to keep track of routine names during error tracebacks.
Users need not worry about this macro in their application codes; however, users can take advantage of
this feature if desired by setting this macro before each user-defined routine that may call SETERRQ(),
PetscCall(). A simple example of usage is given below.
PetscErrorCode MyRoutine1()
{
/* Declarations Here */
PetscFunctionBeginUser;
/* code here */
PetscFunctionReturn(PETSC_SUCCESS);
}
4.6.7 Numbers
PETSc supports the use of complex numbers in application programs written in C, C++, and Fortran. To do
so, we employ either the C99 complex type or the C++ versions of the PETSc libraries in which the basic
“scalar” datatype, given in PETSc codes by PetscScalar, is defined as complex (or complex<double>
for machines using templated complex class libraries). To work with complex numbers, the user should run
configure with the additional option --with-scalar-type=complex. The installation instructions
provide detailed instructions for installing PETSc. You can use --with-clanguage=c (the default) to
use the C99 complex numbers or --with-clanguage=c++ to use the C++ complex type2 .
Recall that each configuration of the PETSc libraries is stored in a different directory, given by $PETSC_DIR/
$PETSC_ARCH according to the architecture. Thus, the libraries for complex numbers are maintained
2 Note that this option is not required to use PETSc with C++
separately from those for real numbers. When using any of the complex numbers versions of PETSc, all
vector and matrix elements are treated as complex, even if their imaginary components are zero. Of course,
one can elect to use only the real parts of the complex numbers when using the complex versions of the
PETSc libraries; however, when working only with real numbers in a code, one should use a version of
PETSc for real numbers for best efficiency.
The program KSP Tutorial ex11 solves a linear system with a complex coefficient matrix. Its Fortran
counterpart is KSP Tutorial ex11f.
When used in a message-passing environment, all communication within PETSc is done through MPI, the
message-passing interface standard [For94]. Any file that includes petscsys.h (or any other PETSc include
file) can freely use any MPI routine.
4.6.9 Graphics
The PETSc graphics library is not intended to compete with high-quality graphics packages. Instead, it is
intended to be easy to use interactively with PETSc programs. We urge users to generate their publication-
quality graphics using a professional graphics package. If a user wants to hook certain packages into PETSc,
he or she should send a message to [email protected]; we will see whether it is reasonable to try to
provide direct interfaces.
Windows as PetscViewers
For drawing predefined PETSc objects such as matrices and vectors, one may first create a viewer using the
command
PetscViewerDrawOpen(MPI_Comm comm, char *display, char *title, int x, int y, int w,␣
,→int h, PetscViewer *viewer);
This viewer may be passed to any of the XXXView() routines. Alternately, one may use command-line
options to quickly specify viewer formats, including PetscDraw-based ones; see Viewing From Options.
To draw directly into the viewer, one must obtain the PetscDraw object with the command
Then one can call any of the PetscDrawXXX() commands on the draw object. If one obtains the draw
object in this manner, one does not call the PetscDrawOpenX() command discussed below.
Predefined viewers, PETSC_VIEWER_DRAW_WORLD and PETSC_VIEWER_DRAW_SELF, may be used at any
time. Their initial use will cause the appropriate window to be created.
Implementations using OpenGL, TikZ, and other formats may be selected with PetscDrawSetType().
PETSc can also produce movies; see PetscDrawSetSaveMovie(), and note that command-line options
can also be convenient; see the PetscDrawSetFromOptions() man page.
By default, PETSc drawing tools employ a private colormap, which remedies the problem of poor color
choices for contour plots due to an external program’s mangling of the colormap. Unfortunately, this may
cause flashing of colors as the mouse is moved between the PETSc windows and other windows. Alternatively,
a shared colormap can be used via the option -draw_x_shared_colormap.
Simple PetscDrawing
With the default format, one can open a window that is not associated with a viewer directly under the X11
Window System with the command
PetscDrawCreate(MPI_Comm comm, char *display, char *title, int x, int y, int w, int h,
,→ PetscDraw *win);
PetscDrawSetFromOptions(win);
All drawing routines are performed relative to the window’s coordinate system and viewport. By default, the
drawing coordinates are from (0,0) to (1,1), where (0,0) indicates the lower left corner of the window.
The application program can change the window coordinates with the command
By default, graphics will be drawn in the entire window. To restrict the drawing to a portion of the window,
one may use the command
These arguments, which indicate the fraction of the window in which the drawing should be done, must
satisfy 0 ≤ xl ≤ xr ≤ 1 and 0 ≤ yl ≤ yr ≤ 1.
To draw a line, one uses the command
PetscDrawLine(PetscDraw win, PetscReal xl, PetscReal yl, PetscReal xr, PetscReal yr,␣
,→int cl);
The argument cl indicates the color (which is an integer between 0 and 255) of the line. A list of predefined
colors may be found in include/petscdraw.h and includes PETSC_DRAW_BLACK, PETSC_DRAW_RED,
PETSC_DRAW_BLUE etc.
To ensure that all graphics actually have been displayed, one should use the command
PetscDrawFlush(PetscDraw win);
When displaying by using double buffering, which is set with the command
PetscDrawSetDoubleBuffer(PetscDraw win);
PetscDrawFlush(PetscDraw win);
in order to swap the buffers. From the options database one may use -draw_pause n, which causes the
PETSc application to pause n seconds at each PetscDrawPause(). A time of -1 indicates that the
application should pause until receiving mouse input from the user.
Text can be drawn with commands
The user can set the text font size or determine it with the commands
Line Graphs
PETSc includes a set of routines for manipulating simple two-dimensional graphs. These routines, which
begin with PetscDrawAxisDraw(), are usually not used directly by the application programmer. Instead,
the programmer employs the line graph routines to draw simple line graphs. As shown in the listing below,
line graphs are created with the command
The argument ncurves indicates how many curves are to be drawn. Points can be added to each of the
curves with the command
The arguments x and y are arrays containing the next point value for each curve. Several points for each
curve may be added with
PetscDrawLGDraw(PetscDrawLG ctx);
A line graph that is no longer needed can be destroyed with the command
PetscDrawLGDestroy(PetscDrawLG *ctx);
To plot new curves, one can reset a linegraph with the command
PetscDrawLGReset(PetscDrawLG ctx);
The line graph automatically determines the range of values to display on the two axes. The user can change
these defaults with the command
It is also possible to change the display of the axes and to label them. This procedure is done by first
obtaining the axes context with the command
One can set the axes’ colors and labels, respectively, by using the commands
It is possible to turn off all graphics with the option -nox. This will prevent any windows from being opened
or any drawing actions to be done. This is useful for running large jobs when the graphics overhead is too
large, or for timing.
The full example, Draw Test ex3, follows.
Listing: src/classes/draw/tests/ex3.c
#if defined(PETSC_APPLE_FRAMEWORK)
#import <PETSc/petscsys.h>
#import <PETSc/petscdraw.h>
#else
#include <petscsys.h>
#include <petscdraw.h>
#endif
PetscFunctionBeginUser;
PetscCall(PetscInitialize(&argc, &argv, NULL, help));
PetscCall(PetscOptionsGetInt(NULL, NULL, "-x", &x, NULL));
PetscCall(PetscOptionsGetInt(NULL, NULL, "-y", &y, NULL));
PetscCall(PetscOptionsGetInt(NULL, NULL, "-width", &width, NULL));
PetscCall(PetscOptionsGetInt(NULL, NULL, "-height", &height, NULL));
PetscCall(PetscOptionsGetInt(NULL, NULL, "-n", &n, NULL));
PetscCall(PetscOptionsGetInt(NULL, NULL, "-nports", &nports, &useports));
PetscCall(PetscOptionsHasName(NULL, NULL, "-nolegend", &flg));
if (flg) legend = NULL;
PetscCall(PetscOptionsHasName(NULL, NULL, "-notoplabel", &flg));
if (flg) toplabel = NULL;
PetscCall(PetscOptionsHasName(NULL, NULL, "-noxlabel", &flg));
if (flg) xlabel = NULL;
PetscCall(PetscOptionsHasName(NULL, NULL, "-noylabel", &flg));
if (flg) ylabel = NULL;
PetscCall(PetscOptionsHasName(NULL, NULL, "-nolabels", &flg));
if (flg) {
toplabel = NULL;
xlabel = NULL;
ylabel = NULL;
(continues on next page)
PetscCall(PetscDrawViewPortsDestroy(ports));
PetscCall(PetscDrawLGDestroy(&lg));
PetscCall(PetscDrawDestroy(&draw));
PetscCall(PetscFinalize());
return 0;
}
For both the linear and nonlinear solvers default routines allow one to graphically monitor convergence of
the iterative method. These are accessed via the command line with -ksp_monitor draw::draw_lg
and -snes_monitor draw::draw_lg. See also Convergence Monitoring and Convergence Monitoring.
To disable all X-window-based graphics, run configure with the additional option --with-x=0
Many PETSc developers use Emacs, which can be used as a “simple” text editor or a comprehensive devel-
opment environment. For a more integrated development environment, we recommend using lsp-mode (or
eglot) with clangd. The most convenient way to teach clangd what compilation flags to use is to install Bear
(“build ear”) and run:
bear make -B
which will do a complete rebuild (-B) of PETSc and capture the compilation commands in a file named
compile_commands.json, which will be automatically picked up by clangd. You can use the same
procedure when building examples or your own project. It can also be used with any other editor that
supports clangd, including VS Code and Vim. When lsp-mode is accompanied by flycheck, Emacs will
provide real-time feedback and syntax checking, along with refactoring tools provided by clangd.
The easiest way to install packages in recent Emacs is to use the “Options” menu to select “Manage Emacs
Packages”.
Tags
It is sometimes useful to cross-reference tags across projects. Regardless of whether you use lsp-mode, it can
be useful to use GNU Global (install gtags) to provide reverse lookups (e.g. find all call sites for a given
function) across all projects you might work on/browse. Tags for PETSc can be generated by running make
allgtags from $PETSC_DIR, or one can generate tags for all projects by running a command such as
from your home directory or wherever you keep source code. If you are making large changes, it is useful to
either set this up to run as a cron job or to make a convenient alias so that refreshing is easy. Then add the
following to ~/.emacs to enable gtags and specify key bindings.
A more basic alternative to the GNU Global (gtags) approach that does not require adding packages is to
use the builtin etags feature. First, run make alletags from the PETSc home directory to generate the
file $PETSC_DIR/TAGS, and then from within Emacs, run
M-x visit-tags-table
where M denotes the Emacs Meta key, and enter the name of the TAGS file. Then the command M-. will
cause Emacs to find the file and line number where a desired PETSc function is defined. Any string in any
of the PETSc files can be found with the command M-x tags-search. To find repeated occurrences, one
can simply use M-, to find the next occurrence.
VS Code (unlike Visual Studio Users, described below) is an open source editor with a rich extension
ecosystem. It has excellent integration with clangd and will automatically pick up compile_commands.
json as produced by a command such as bear make -B (see Developer Environments). If you have no
prior attachment to a specific code editor, we recommend trying VS Code.
This section lists helpful Vim commands for PETSc. Ones that configure Vim can be placed in a .vimrc
file in the top of the PETSc directory and will be loaded automatically.
Vim has configurable keymaps: all of the “command mode” commands given that start with a colon (such
as :help) can be assigned to short sequences in “normal mode,” which is how most Vim users use their
most frequently used commands.
See the Developer Environments discussion above for configuration of clangd, which provides integrated
development environment.
Tags
The tags feature can be used to search PETSc files quickly and efficiently. To use this feature, one should
first check if the file, $PETSC_DIR/CTAGS exists. If this file is not present, it should be generated by running
make alletags from the PETSc home directory. Once the file exists, from Vi/Vim the user should issue
the command
:set tags=CTAGS
from the $PETSC_DIR directory and enter the name of the CTAGS file. The command :tag functionname
will cause Vi/Vim to open the file and line number where a desired PETSc function is defined in the current
window. <Ctrl-o> will return the screen to your previous location.
The command :stag functionname will split the current window and then open the file and line number
for that function in one half. Some prefer this because it is easier to compare the file you are editing to the
function definition this way.
Vim can also use the cscope utility to navigate source code. One useful thing it can do that the basic
tags feature can’t is search for references to a symbol, rather than its definition, which can be useful for
refactoring. The command
:cs find s functionname
opens a list of all of the places the function is called in PETSc, and opens the file and line that you choose.
The variant :scs find s functionname does the same but splits the window like stag.
The PETSc makefile does not have a command for building a cscope database, but GNU Global is cross-
compatible with cscope: call make allgtags to make the gtags database, and run the commands
:set csprg=gtags-cscope
:cs add GTAGS
Quickfix
Rather than exiting editing a file to build the library and check for errors or warnings, calling :make runs
the make command without leaving Vim and collects the errors and warnings in a “quickfix” window. Move
the cursor to one of the errors or warnings in the quickfix window and press <Enter> and the main window
will jump to the file and line with the error. The following commands filter lines of out PETSc’s make output
that can clutter the quickfix window:
Autocompletion of long function names can be helpful when working with PETSc. If you have a tags file,
you can press <Ctrl-N> when you have partially typed a word to bring up a list of potential completions
that you can choose from with <Tab>.
More powerful autocompletion, such as completing the fieldname of a struct, is available from external
plugins that can be added to Vim, such as SuperTab, VimCompletesMe, or YouCompleteMe.
Along the same lines, plugins can be added that fill in the boilerplate associated with PETSc programming
with code snippets. One such tool is UltiSnips.
Several plugins provide the equivalent of emacs’ lsp-mode: YouCompleteMe, mentioned above, is one; another
popular one is ale. These can check for syntax errors, check for compilation errors in the background, and
provide sophisticated tools for refactoring. Like lsp-mode, they also rely on a compilation database, so bear
-- make -B should be used as well to generate the file compile_commands.json.
See online tutorials for additional Vi/Vim options.
If you are interested in developing code that uses PETSc from Eclipse or developing PETSc in Eclipse
and have knowledge of how to do indexing and build libraries in Eclipse, please contact us at petsc-
[email protected].
One way to index and build PETSc in Eclipse is as follows.
1. Open “File→Import→Git→Projects from Git”. In the next two panels, you can either add your existing
local repository or download PETSc from Bitbucket by providing the URL. Most Eclipse distributions
come with Git support. If not, install the EGit plugin. When importing the project, select the wizard
“Import as general project”.
2. Right-click on the project (or the “File” menu on top) and select “New → Convert to a C/C++ Project
(Adds C/C++ Nature)”. In the setting window, choose “C Project” and specify the project type as
“Shared Library”.
3. Right-click on the C project and open the “Properties” panel. Under “C/C++ Build → Builder
Settings”, set the Build directory to $PETSC_DIR and make sure “Generate Makefiles automatically”
is unselected. Under the section “C/C++ General→Paths and Symbols”, add the PETSc paths to
“Includes”.
$PETSC_DIR/include
$PETSC_DIR/$PETSC_ARCH/include
1. Configure PETSc normally outside Eclipse to generate a makefile and then build the project in Eclipse.
The source code will be parsed by Eclipse.
If you launch Eclipse from the Dock on Mac OS X, .bashrc will not be loaded (a known OS
X behavior, for security reasons). This will be a problem if you set the environment variables
$PETSC_DIR and $PETSC_ARCH in .bashrc. A solution which involves replacing the executable can
be found at `/questions/829749/launch-mac-eclipse-with-environment-variables-set
</questions/829749/launch-mac-eclipse-with-environment-variables-set>‘__. Alternatively, you can add
$PETSC_DIR and $PETSC_ARCH manually under “Properties → C/C++ Build → Environment”.
To allow an Eclipse code to compile with the PETSc include files and link with the PETSc libraries, a PETSc
user has suggested the following.
1. Right-click on your C project and select “Properties → C/C++ Build → Settings”
2. A new window on the righthand side appears with various settings options. Select “Includes” and add
the required PETSc paths,
$PETSC_DIR/include
$PETSC_DIR/$PETSC_ARCH/include
1. Select “Libraries” under the header Linker and set the library search path:
$PETSC_DIR/$PETSC_ARCH/lib
m, petsc, stdc++, mpichxx, mpich, lapack, blas, gfortran, dl, rt,gcc_s, pthread, X11
Another PETSc user has provided the following steps to build an Eclipse index for PETSc that can be used
with their own code, without compiling PETSc source into their project.
1. In the user project source directory, create a symlink to the PETSC src/ directory.
2. Refresh the project explorer in Eclipse, so the new symlink is followed.
3. Right-click on the project in the project explorer, and choose “Index → Rebuild”. The index should
now be built.
4. Right-click on the PETSc symlink in the project explorer, and choose “Exclude from build…” to make
sure Eclipse does not try to compile PETSc with the project.
For further examples of using Eclipse with a PETSc-based application, see the documentation for LaMEM3 .
3 See the doc/ directory at https://bitbucket.org/bkaus/lamem
This information was provided by Mohammad Mirzadeh. The Qt Creator IDE is part of the Qt SDK,
developed for cross-platform GUI programming using C++. It is available under GPL v3, LGPL v2 and a
commercial license and may be obtained, either as part of the Qt SDK or as stand-alone software. It supports
automatic makefile generation using cross-platform qmake and CMake build systems as well as allowing one
to import projects based on existing, possibly hand-written, makefiles. Qt Creator has a visual debugger
using GDB and LLDB (on Linux and OS X) or Microsoft’s CDB (on Microsoft Windows) as backends. It
also has an interface to Valgrind’s “memcheck” and “callgrind” tools to detect memory leaks and profile
code. It has built-in support for a variety of version control systems including git, mercurial, and subversion.
Finally, Qt Creator comes fully equipped with auto-completion, function look-up, and code refactoring tools.
This enables one to easily browse source files, find relevant functions, and refactor them across an entire
project.
Creating a Project
When using Qt Creator with qmake, one needs a .pro file. This configuration file tells Qt Creator about
all build/compile options and locations of source files. One may start with a blank .pro file and fill in
configuration options as needed. For example:
INCLUDEPATH += $$PETSC_INCLUDES
LIBS += $$PETSC_LIBS
CONFIG += link_pkgconfig
PKGCONFIG += PETSc
# OTHER_FILES are ignored during compilation but will be shown in file panel in Qt␣
,→Creator
OTHER_FILES += \
(continues on next page)
To use PETSc from Microsoft Visual Studio, one would have to compile a PETSc example with its corre-
sponding makefile and then transcribe all compiler and linker options used in this build into a Visual Studio
project file, in the appropriate format in Visual Studio project settings.
See doc_macos_install for the standard Unix command line tools approach to development on macOS.
The information below is only if you plan to write code within the Xcode IDE.
This chapter introduces additional features of the PETSc matrices and solvers.
This extracts the rows and cols of the matrix A into B. If call is MAT_INITIAL_MATRIX it will create
the matrix B. If call is MAT_REUSE_MATRIX it will reuse the B created with a previous call.
One can also extract one or more submatrices per MPI rank with
This extracts n (zero or more) matrices with the rows[k] and cols[k] of the matrix A into an array
of sequential matrices B[k] on this process. If call is MAT_INITIAL_MATRIX it will create the array of
matrices B. If call is MAT_REUSE_MATRIX it will reuse the B created with a previous call. The IS arguments
are sequential. The array of matrices should be destroyed with MatDestroySubMatrices().
Each submatrix may be parallel, existing on a MPI_Comm associated with each pair of IS rows[k] and
cols[k], using
where collections of non-overlapping MPI ranks share a single parallel matrix on their sub-communicator.
The routine
where nsubcomm copies of the entire matrix are stored, one on each subcomm. The routine PetscSub-
commCreate() and its PetscSubComm object may, but need not be, used to construct the subcomm.
The routine
Normally, PETSc users will access the matrix solvers through the KSP interface, as discussed in KSP: Linear
System Solvers, but the underlying factorization and triangular solve routines are also directly accessible to
the user.
The ILU, LU, ICC, Cholesky, and QR matrix factorizations are split into two or three stages depending on
the user’s needs. The first stage is to calculate an ordering for the matrix. The ordering generally is done
to reduce fill in a sparse factorization; it does not make much sense for a dense matrix.
Here A is the matrix for which we wish to generate a new ordering, type may be ignored and rowperm and
colperm are the row and column permutations generated by the ordering routine. The user registers the
ordering routine with the command
The input argument ordname is a string of the user’s choice, either an ordering defined in petscmat.h or
the name of a new ordering introduced by the user. See the code in src/mat/impls/order/sorder.c
and other files in that directory for examples on how the reordering routines may be written.
Once the reordering routine has been registered, it can be selected for use at runtime with the command
line option -pc_factor_mat_ordering_type ordname. If reordering from the API, the user should
provide the ordname as the second input argument of MatGetOrdering().
PETSc matrices interface to a variety of external factorization/solver packages via the Mat-
SolverType which can be MATSOLVERSUPERLU_DIST, MATSOLVERMUMPS, MATSOLVERPASTIX,
MATSOLVERMKL_PARDISO, MATSOLVERMKL_CPARDISO, MATSOLVERUMFPACK, MATSOLVERCHOLMOD,
MATSOLVERKLU, MATSOLVERCUSPARSE, MATSOLVERCUSPARSEBAND, MATSOLVERCUDA, and MAT-
SOLVERKOKKOSDEVICE. The last three of which can run on GPUs, while MATSOLVERSUPERLU_DIST
can partially run on GPUs. See doc_linsolve for a table of the factorization based solvers in PETSc.
Most of these packages compute their own orderings and cannot use ones provided so calls to the following
routines with those packages can pass NULL as the IS permutations.
The following routines perform incomplete and complete, in-place, symbolic, and numerical factorizations
for symmetric and nonsymmetric matrices, respectively:
The argument info->fill > 1 is the predicted fill expected in the factored matrix, as a ratio of the
original fill. For example, info->fill=2.0 would indicate that one expects the factored matrix to have
twice as many nonzeros as the original.
For sparse matrices it is very unlikely that the factorization is actually done in-place. More likely, new space
is allocated for the factored matrix and the old space deallocated, but to the user it appears in-place because
the factored matrix replaces the unfactored matrix.
The two factorization stages can also be performed separately, by using the preferred out-of-place mode, first
one obtains that matrix object that will hold the factor using
or
In this case, the contents of the matrix result is undefined between the symbolic and numeric factoriza-
tion stages. It is possible to reuse the symbolic factorization. For the second and succeeding factorizations,
one simply calls the numerical factorization with a new input matrix and the same factored result ma-
trix. It is essential that the new input matrix have exactly the same nonzero structure as the original
factored matrix. (The numerical factorization merely overwrites the numerical values in the factored ma-
trix and does not disturb the symbolic portion, thus enabling reuse of the symbolic phase.) In general,
calling XXXFactorSymbolic with a dense matrix will do nothing except allocate the new matrix; the
XXXFactorNumeric routines will do all of the work.
Why provide the plain XXXfactor routines when one could simply call the two-stage routines? The answer
is that if one desires in-place factorization of a sparse matrix, the intermediate stage between the symbolic and
numeric phases cannot be stored in a result matrix, and it does not make sense to store the intermediate
values inside the original matrix that is being transformed. We originally made the combined factor routines
do either in-place or out-of-place factorization, but then decided that this approach was not needed and
could easily lead to confusion.
We do not provide our own sparse matrix factorization with pivoting for numerical stability. This is because
trying to both reduce fill and do pivoting can become quite complicated. Instead, we provide a poor stepchild
substitute. After one has obtained a reordering, with MatGetOrdering(Mat A,MatOrdering type,IS
*row,IS *col) one may call
which will try to reorder the columns to ensure that no values along the diagonal are smaller than tol in a
absolute value. If small values are detected and corrected for, a nonsymmetric permutation of the rows and
columns will result. This is not guaranteed to work, but may help if one was simply unlucky in the original
ordering. When using the KSP solver interface the option -pc_factor_nonzeros_along_diagonal
<tol> may be used. Here, tol is an optional tolerance to decide if a value is nonzero; by default it is
1.e-10.
The external MatSolverType’s MATSOLVERSUPERLU_DIST and MATSOLVERMUMPS do manage numerical
pivoting internal to their API.
The external factorization packages each provide a wide number of options to chose from, details on these
may be found by consulting the manual page for the solver package, such as, MATSOLVERSUPERLU_DIST.
Most of the options can be easily set via the options database even when the factorization solvers are accessed
via KSP.
Once a matrix has been factored, it is natural to solve linear systems. The following four routines enable
this process:
matrix A of these routines must have been obtained from a factorization routine; otherwise, an error will be
generated. In general, the user should use the KSP solvers introduced in the next chapter rather than using
these factorization and solve routines directly.
Some of the factorizations also support solves with multiple right hand sides stored in a Mat using
and
Finally, MATSOLVERMUMPS, provides access to Schur complements obtained after partial factorizations as
well as the inertia of a matrix via MatGetInertia().
PETSc matrices provide code for computing various matrix-matrix products. This section will introduce the
two sets of routines available. For now consult MatCreateProduct() and MatMatMult().
Users obtain their preconditioner contexts from the KSP context with the command KSPGetPC(). It is
possible to create, manipulate, and destroy PC contexts directly, although this capability should rarely be
needed. To create a PC context, one uses the command
The routine
set the matrices that are to be used with the preconditioner. The routine
In particular, for a preconditioner matrix, B, that has been set via PCSetOperators(pc,Amat,Pmat),
the routine PCApply(pc,x,y) computes y = B −1 x by solving the linear system By = x with the specified
preconditioner method.
Additional preconditioner routines are
,→PCRichardsonConvergedReason*);
The first two routines apply the action of the matrix followed by the preconditioner or the preconditioner
followed by the matrix depending on whether the right is PC_LEFT or PC_RIGHT. The final routine applies
its iterations of Richardson’s method. The last three routines are provided to improve efficiency for certain
Krylov subspace methods.
A PC context that is no longer needed can be destroyed with the command
PCDestroy(PC *pc);
Users should set $PETSC_DIR and $PETSC_ARCH before running the tests, or can provide them on the
command line as below.
To check if the libraries are working do:
For comprehensive testing of builds, the general invocation from the $PETSC_DIR is:
or
For testing configure that used the --prefix option, the general invocation from the installation (prefix)
directory is:
which will create/use the directories tests/* in the current directory for generated test files. You may
pass an additional argument TESTDIR=mytests to place these generated files elsewhere.
For a full list of options, use
$ make help-test
Depending on your machine’s configuration running the full test suite (above) can take from a few minutes
to a couple hours. Note that currently we do not have a mechanism for automatically running the test suite
on batch computer systems except to obtain an interactive compute node (via the batch system) and run
the tests on that node (this assumes that the compilers are available on the interactive compute nodes.
The test reporting system classifies them according to the Test Anywhere Protocal (TAP)11 . In brief, the
categories are
• ok The test passed.
• not ok The test failed.
11 See https://testanything.org/tap-specification.html
• not ok #SKIP The test was skipped, usually because build requirements were not met (for example,
an external solver library was required, but PETSc was not configure for that library.) compiled
against it).
• ok #TODO The test is under development by the developers.
The tests are a series of shell scripts, generated by information contained within the test source file, that
are invoked by the makefile system. The tests are run in $PETSC_DIR/$PETSC_ARCH/tests with the
same directory as the source tree underneath. For testing installs, the default location is ${PREFIX_DIR}/
tests but this can be changed with the TESTDIR location. (See Directory Structure). A label is used to
denote where it can be found within the source tree. For example, test vec_vec_tutorials-ex6, which
can be run e.g. with
$ $PETSC_DIR/$PETSC_ARCH/tests/vec/vec/tutorials/runex6.sh
These shell scripts can be run independently in those directories, and take arguments to show the commands
run, change arguments, etc. Use the -h option to the shell script to see these options.
Often, you want to run only a subset of tests. Our makefiles use gmake’s wildcard syntax. In this syntax,
% is a wild card character and is passed in using the search argument. Two wildcard characters cannot
be used in a search, so the searchin argument is used to provide the equivalent of %pattern% search.
The default examples have default arguments, and we often wish to test examples with various arguments;
we use the argsearch argument for these searches. Like searchin, it does not use wildcards, but rather
whether the string is within the arguments.
Some examples are:
It is useful before invoking the tests to see what targets will be run. The print-test target helps with
this:
To see all of the test targets which would be run, this command can be used:
$ make print-test
To learn more about the test system details, one can look at the the PETSc developers documentation.
[For94] MPI Forum. MPI: a message-passing interface standard. International J. Supercomputing Appli-
cations, 1994.
[GLS94] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming
with the Message Passing Interface. MIT Press, 1994.
[ref-EM17] Jennifer B Erway and Roummel F Marcia. On solving large-scale limited-memory quasi-newton
equations. Linear Algebra and its Applications, 515:196–225, 2017.
[ref-GL89] J. C. Gilbert and C. Lemarechal. Some numerical experiments with variable-storage quasi-newton
algorithms. Mathematical Programming, 45:407–434, 1989.
[ref-Gri12] Andreas Griewank. Broyden updating, the good and the bad! Optimization Stories, Documenta
Mathematica. Extra Volume: Optimization Stories, pages 301–315, 2012.
[ref-NW99] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer-Verlag, New York,
1999.
[BBKL11] Achi Brandt, James Brannick, Karsten Kahl, and Irene Livshits. Bootstrap AMG. SIAM Journal
on Scientific Computing, 33(2):612–632, 2011.
[CS97] X.-C. Cai and M. Sarkis. A restricted additive Schwarz preconditioner for general sparse linear
systems. Technical Report CU-CS 843-97, Computer Science Department, University of Colorado-
Boulder, 1997. (accepted by SIAM J. of Scientific Computing).
[CGS+94] Tony F Chan, Efstratios Gallopoulos, Valeria Simoncini, Tedd Szeto, and Charles H Tong. A
quasi-minimal residual variant of the Bi-CGSTAB algorithm for nonsymmetric systems. SIAM
Journal on Scientific Computing, 15(2):338–347, 1994.
[Eis81] S. Eisenstat. Efficient implementation of a class of CG methods. SIAM J. Sci. Stat. Comput.,
2:1–4, 1981.
[EES83] S.C. Eisenstat, H.C. Elman, and M.H. Schultz. Variational iterative methods for nonsymmetric
systems of linear equations. SIAM Journal on Numerical Analysis, 20(2):345–357, 1983.
[FGN92] R. Freund, G. H. Golub, and N. Nachtigal. Iterative Solution of Linear Systems, pages 57–100.
Acta Numerica. Cambridge University Press, 1992.
[Fre93] Roland W. Freund. A transpose-free quasi-minimal residual algorithm for non-Hermitian linear
systems. SIAM J. Sci. Stat. Comput., 14:470–482, 1993.
[GAMV13] P. Ghysels, T.J. Ashby, K. Meerbergen, and W. Vanroose. Hiding global communication latency
in the GMRES algorithm on massively parallel machines. SIAM Journal on Scientific Computing,
35(1):C48–C71, 2013.
297
PETSc/TAO Users Manual, Release 3.19.2
[GV14] P. Ghysels and W. Vanroose. Hiding global synchronization latency in the preconditioned con-
jugate gradient algorithm. Parallel Computing, 40(7):224–238, 2014. 7th Workshop on Parallel
Matrix Algorithms and Applications. doi:10.1016/j.parco.2013.06.001.
[HS52] Magnus R. Hestenes and Eduard Steifel. Methods of conjugate gradients for solving linear systems.
J. Research of the National Bureau of Standards, 49:409–436, 1952.
[ISG15] Tobin Isaac, Georg Stadler, and Omar Ghattas. Solution of nonlinear Stokes equations dis-
cretized by high-order finite elements on nonconforming and anisotropic meshes, with appli-
cation to ice sheet dynamics. SIAM Journal on Scientific Computing, 37(6):804–833, 2015.
doi:10.1137/140974407.
[Not00] Yvan Notay. Flexible Conjugate Gradients. SIAM Journal on Scientific Computing, 22(4):1444–
1460, 2000.
[PS75] C. C. Paige and M. A. Saunders. Solution of sparse indefinite systems of linear equations. SIAM
Journal on Numerical Analysis, 12:617–629, 1975.
[Saa93] Youcef Saad. A flexible inner-outer preconditioned GMRES algorithm. SIAM Journal on Scientific
Computing, 14(2):461–469, 1993. doi:10.1137/0914028.
[SS86] Youcef Saad and Martin H. Schultz. GMRES: a generalized minimal residual algorithm for solving
nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 7:856–869, 1986.
[Saa03] Yousef Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2nd edition, 2003.
doi:10.1016/S1570-579X(01)80025-2.
[SSM16] P. Sanan, S. M. Schnepp, and D. A. May. Pipelined, flexible Krylov subspace methods. SIAM
Journal on Scientific Computing, 38(5):C441–C470, 2016. doi:10.1137/15M1049130.
[SBjorstadG96] Barry F. Smith, Petter Bjørstad, and William D. Gropp. Domain Decomposition: Parallel
Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, 1996.
URL: http://www.mcs.anl.gov/~bsmith/ddbook.html.
[Son89] Peter Sonneveld. CGS, a fast Lanczos-type solver for nonsymmetric linear systems. SIAM J. Sci.
Stat. Comput., 10:36–52, 1989.
[vdV03] H. van der Vorst. Iterative Krylov Methods for Large Linear Systems. Cambridge University Press,
2003. ISBN 9780521818285.
[vandVorst92] H. A. van der Vorst. BiCGSTAB: a fast and smoothly converging variant of BiCG for the
solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 13:631–644, 1992.
[BS90] Peter N. Brown and Youcef Saad. Hybrid Krylov methods for nonlinear systems of equations.
SIAM J. Sci. Stat. Comput., 11:450–481, 1990.
[BKST15] Peter R. Brune, Matthew G. Knepley, Barry F. Smith, and Xuemin Tu. Composing scalable
nonlinear algebraic solvers. SIAM Review, 57(4):535–565, 2015. http://www.mcs.anl.gov/papers/
P2010-0112.pdf. URL: http://www.mcs.anl.gov/papers/P2010-0112.pdf, doi:10.1137/130936725.
[EW96] S. C. Eisenstat and H. F. Walker. Choosing the forcing terms in an inexact Newton method. SIAM
J. Scientific Computing, 17:16–32, 1996.
[JP93] Mark T. Jones and Paul E. Plassmann. A parallel graph coloring heuristic. SIAM J. Sci. Comput.,
14(3):654–669, 1993.
[MoreSGH84] Jorge J. Moré, Danny C. Sorenson, Burton S. Garbow, and Kenneth E. Hillstrom. The MIN-
PACK project. In Wayne R. Cowell, editor, Sources and Development of Mathematical Software,
88–111. 1984.
[PW98] M. Pernice and H. F. Walker. NITSOL: a Newton iterative solver for nonlinear systems. SIAM J.
Sci. Stat. Comput., 19:302–318, 1998.
298 Bibliography
PETSc/TAO Users Manual, Release 3.19.2
[DennisJrS83] J. E. Dennis Jr. and Robert B. Schnabel. Numerical Methods for Unconstrained Optimization
and Nonlinear Equations. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1983.
[ARS97] U.M. Ascher, S.J. Ruuth, and R.J. Spiteri. Implicit-explicit Runge-Kutta methods for time-
dependent partial differential equations. Applied Numerical Mathematics, 25:151–167, 1997.
[AP98] Uri M Ascher and Linda R Petzold. Computer methods for ordinary differential equations and
differential-algebraic equations. Volume 61. SIAM, 1998.
[BPR11] S. Boscarino, L. Pareschi, and G. Russo. Implicit-explicit Runge-Kutta schemes for hyperbolic
systems and kinetic equations in the diffusion limit. Arxiv preprint arXiv:1110.4375, 2011.
[ColomesB16] Oriol Colomés and Santiago Badia. Segregated Runge–Kutta methods for the incompress-
ible Navier–Stokes equations. International Journal for Numerical Methods in Engineering,
105(5):372–400, 2016.
[Con16] E.M. Constantinescu. Estimating global errors in time stepping. ArXiv e-prints, March 2016.
arXiv:1503.05166.
[GKC13] F.X. Giraldo, J.F. Kelly, and E.M. Constantinescu. Implicit-explicit formulations of a three-
dimensional nonhydrostatic unified model of the atmosphere (NUMA). SIAM Journal on Scientific
Computing, 35(5):B1162–B1194, 2013. doi:10.1137/120876034.
[KC03] C.A. Kennedy and M.H. Carpenter. Additive Runge-Kutta schemes for convection-diffusion-
reaction equations. Appl. Numer. Math., 44(1-2):139–181, 2003. doi:10.1016/S0168-9274(02)00138-
1.
[PR05] L. Pareschi and G. Russo. Implicit-explicit Runge-Kutta schemes and applications to hyperbolic
systems with relaxation. Journal of Scientific Computing, 25(1):129–155, 2005.
[RA05] J. Rang and L. Angermann. New Rosenbrock W-methods of order 3 for partial differential algebraic
equations of index 1. BIT Numerical Mathematics, 45(4):761–787, 2005.
[SVB+97] A. Sandu, J.G. Verwer, J.G. Blom, E.J. Spee, G.R. Carmichael, and F.A. Potra. Benchmarking
stiff ode solvers for atmospheric chemistry problems II: Rosenbrock solvers. Atmospheric Environ-
ment, 31(20):3459–3472, 1997.
[AG07] Galen Andrew and Jianfeng Gao. Scalable training of l1-regularized log-linear models. In Proceed-
ings of the 24th international conference on Machine learning (ICML), 33–40. 2007.
[Arm66] L. Armijo. Minimization of functions having Lipschitz-continuous first partial derivatives. Pacific
Journal of Mathematics, 16:1–3, 1966.
[Ber82] Dimitri P. Bertsekas. Projected Newton methods for optimization problems with simple con-
straints. SIAM Journal on Control and Optimization, 20:221–246, 1982.
[BPC+11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, and others. Distributed
optimization and statistical learning via the alternating direction method of multipliers. Founda-
tions and Trends® in Machine learning, 3(1):1–122, 2011.
[CGT00] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Trust-Region Methods. SIAM, Philadelphia, Penn-
sylvania, 2000.
[CSV09] Andrew R. Conn, Katya Scheinberg, and Luís N. Vicente. Introduction to Derivative-Free Opti-
mization. MPS/SIAM Series on Optimization. Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA, 2009. ISBN 0-89871-460-5.
[Cot64] R. W. Cottle. Nonlinear programs with positively bounded Jacobians. PhD thesis, Department of
Mathematics, University of California, Berkeley, California, 1964.
[FFK97] Francisco Facchinei, Andreas Fischer, and Christian Kanzow. A semismooth Newton method for
variational inequalities: The case of box constraints. Complementarity and Variational Problems:
State of the Art, 92:76, 1997.
Bibliography 299
PETSc/TAO Users Manual, Release 3.19.2
[FP97] M. C. Ferris and J. S. Pang. Engineering and economic applications of complementarity problems.
SIAM Review, 39:669–713, 1997. URL: http://www.siam.org/journals/sirev/39-4/28596.html.
[Fis92] A. Fischer. A special Newton–type optimization method. Optimization, 24:269–284, 1992.
[GLL86] L. Grippo, F. Lampariello, and S. Lucidi. A nonmonotone line search technique for Newton’s
method. SIAM Journal on Numerical Analysis, 23:707–716, 1986.
[Hes69] Magnus R Hestenes. Multiplier and gradient methods. Journal of optimization theory and appli-
cations, 4(5):303–320, 1969.
[HP98] J. Huang and J. S. Pang. Option pricing and linear complementarity. Journal of Computational
Finance, 2:31–60, 1998.
[Kar39] W. Karush. Minima of functions of several variables with inequalities as side conditions. Master’s
thesis, Department of Mathematics, University of Chicago, 1939.
[KT51] H. W. Kuhn and A. W. Tucker. Nonlinear programming. In J. Neyman, editor, Proceedings of the
Second Berkeley Symposium on Mathematical Statistics and Probability, pages 481–492. University
of California Press, Berkeley and Los Angeles, 1951.
[LMore99] C.-J. Lin and J. J. Moré. Newton’s method for large bound-constrained optimization problems.
SIOPT, 9(4):1100–1127, 1999. URL: http://www.mcs.anl.gov/home/more/papers/nb.ps.gz.
[Mif77] R. Mifflin. Semismooth and semiconvex functions in constrained optimization. SIAM Journal on
Control and Optimization, 15:957–972, 1977.
[MoreT91] Jorge J. Moré and G. Toraldo. On the solution of large quadratic programming problems with
bound constraints. SIOPT, 1:93–113, 1991.
[MoreT92] Jorge J. Moré and David Thuente. Line search algorithms with guaranteed sufficient decrease.
Technical Report MCS-P330-1092, Mathematics and Computer Science Division, Argonne Na-
tional Laboratory, 1992.
[MFF+01] T. S. Munson, F. Facchinei, M. C. Ferris, A. Fischer, and C. Kanzow. The semismooth algorithm
for large scale complementarity problems. INFORMS Journal on Computing, 2001.
[Nas50] J. F. Nash. Equilibrium points in N–person games. Proceedings of the National Academy of Sci-
ences, 36:48–49, 1950.
[NM65] J. A. Nelder and R. Mead. A simplex method for function minimization. Computer Journal,
7:308–313, 1965.
[NW99] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer-Verlag, New York, 1999.
[Pow69] Michael JD Powell. A method for nonlinear constraints in minimization problems. Optimization,
pages 283–298, 1969.
[Qi93] L. Qi. Convergence analysis of some algorithms for solving nonsmooth equations. Mathematics of
Operations Research, 18:227–244, 1993.
[QS93] L. Qi and J. Sun. A nonsmooth version of Newton’s method. Mathematical Programming, 58:353–
368, 1993.
[Roc74] R Tyrrell Rockafellar. Augmented lagrange multiplier functions and duality in nonconvex pro-
gramming. SIAM Journal on Control, 12(2):268–285, 1974.
[DeLucaFK96] T. De Luca, F. Facchinei, and C. Kanzow. A semismooth equation approach to the solution
of nonlinear complementarity problems. Mathematical Programming, 75:407–439, 1996.
[KortelainenLesinskiMore+10] M. Kortelainen, T. Lesinski, J. Moré, W. Nazarewicz, J. Sarich, N. Schunck,
M. V. Stoitsov, and S. M. Wild. Nuclear energy density optimization. Physical Review C,
82(2):024313, 2010. doi:10.1103/PhysRevC.82.024313.
300 Bibliography
[Ala10] Frédéric Alauzet. Metric-based anisotropic mesh adaptation. https://pages.saclay.inria.fr/frederic.
alauzet/cours/cea2010_V3.pdf, 2010.
[Getal] William Gropp and et. al. MPICH Web page. http://www.mpich.org. URL: http://www.mpich.
org.
[HL91] Virginia Herrarte and Ewing Lusk. Studying parallel program behavior with Upshot. Technical
Report ANL-91/15, Argonne National Laboratory, August 1991.
Mathematics and Computer Science Division
Argonne National Laboratory
9700 South Cass Avenue, Bldg. 240
Argonne, IL 60439
www.anl.gov