0% found this document useful (0 votes)
105 views

05 Programmer's Reference, With Instructions On How To Execute The Program

This document provides an overview of programming features for GAMESS that are machine-independent. It discusses the history of parallelizing GAMESS, the distributed data parallel (DDI) compute and data server processes, memory allocations, performance examples, programming conventions, and disk files used. Installation involves compiling the source code, DDI library, and linking the executable while following machine-specific instructions.

Uploaded by

Phil Sunny
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views

05 Programmer's Reference, With Instructions On How To Execute The Program

This document provides an overview of programming features for GAMESS that are machine-independent. It discusses the history of parallelizing GAMESS, the distributed data parallel (DDI) compute and data server processes, memory allocations, performance examples, programming conventions, and disk files used. Installation involves compiling the source code, DDI library, and linking the executable while following machine-specific instructions.

Uploaded by

Phil Sunny
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Programmer’s Reference 5-1

(26 May 2004)

**************************************
* *
* Section 5 - Programmer's Reference *
* *
**************************************

This section describes features of GAMESS programming


which are true for all machines. See the section 'hardware
specifics' for information about specific machines. The
contents of this section are:

Installation overview ____________________________________________________ 2


Running Distributed Data Parallel GAMESS ________________________________ 5
parallelization history ________________________________________________________ 5
DDI compute and data server processes _________________________________________ 6
memory allocations and check jobs ____________________________________________ 11
representative performance examples __________________________________________ 13
Altering program limits _________________________________________________ 21
Names of source code modules ___________________________________________ 23
Programming Conventions ______________________________________________ 27
Parallel broadcast identifiers_____________________________________________ 30
Disk files used by GAMESS______________________________________________ 32
Contents of the direct access file 'DICTNRY'________________________________ 37
Programmer’s Reference 5-2

Installation overview

Very specific compiling directions are given in a file


provided with the GAMESS distribution, namely
~/gamess/misc/readme.unix
and this should be followed closely. The directions here
are of a more general nature.

Before starting the installation, you should also see


the pages about your computer in the 'Hardware Specifics'
section of this manual, and at the compiler version notes
that are kept in the script ‘comp’. There might be some
special instructions for your machine.

The first step in installing GAMESS should be to print


the manual. If you are reading this, you've got that
done! The second step would be to get the source code
activator compiled and linked (note that the activator
must be activated manually before it is compiled). Third,
you should now compile all the quantum chemistry sources.
Fourth, compile the DDI message passing library, and its
process kickoff program. Fifth, link the GAMESS program.
Finally, run all the short examples provided with GAMESS,
and very carefully compare the key results shown in the
'sample input' section against your outputs. These
"correct" results are from a IBM RS/6000, so there may be
very tiny (last digit) precision differences for other
machines. That's it! The rest of this section gives a
little more detail about some of these steps.

* * * * *

GAMESS will run on essentially any machine with a


FORTRAN 77 compiler. However, even given the F77 standard
there are still a number of differences between various
machines. For example, some chips still use 32 bit
integers, as primitive as that may seem, while many chips
now allow for 64 bit processing (and hence large run-time
memory usage). It is also necessary to have a C compiler,
as the message passing library is implemented entirely in
that language.

Although there are many types of computers, there is


only one (1) version of GAMESS.

This portability is made possible mainly by keeping


machine dependencies to a minimum (that is, writing in
Programmer’s Reference 5-3
FORTRAN77, not vendor specific language extensions). The
unavoidable few statements which do depend on the hardware
are commented out, for example, with "*I64" in columns
1-4. Before compiling GAMESS on a 64 bit machine, these
four columns must be replaced by 4 blanks. The process of
turning on a particular machine's specialized code is
dubbed "activation".

A semi-portable FORTRAN 77 program to activate the


desired machine dependent lines is supplied with the
GAMESS package as program ACTVTE. Before compiling ACTVTE
on your machine, use your text editor to activate the very
few machine dependent lines in ACTVTE before compiling it.
Be careful not to change the DATA initialization!

* * * * *

The quantum chemistry source code of GAMESS is in the


directory
~/gamess/source
and consists almost entirely of unactivated FORTRAN source
code, stored as *.src. There is a bit of C code in this
directory to implement runtime memory allocation.

The task of building an executable for GAMESS is:


activate compile link
*.SRC ---> *.FOR ---> *.OBJ ---> *.EXE
source FORTRAN object executable
code code code image
where the intermediate files *.FOR and *.OBJ are discarded
once the executable has been linked. It may seem odd at
first to delete FORTRAN code, but this can always be
reconstructed from the master source code using ACTVTE.

The advantage of maintaining only one master version


is obvious. Whenever any improvements are made, they are
automatically in place for all the currently supported
machines. There is no need to make the same changes in a
plethora of other versions.

* * * * *

The Distributed Data Interface (DDI) is the message


passing layer, supporting the parallel execution of GAMESS.
It is stored in the directory tree
~/gamess/ddi
It is necessary to compile this software, even if you don’t
intend to run on more than one processor. This directory
contains a file readme.ddi with directions about compiling,
Programmer’s Reference 5-4
and customizing your computer to enable the use of System V
memory allocation routines. It also has information about
some high end parallel computer systems.

* * * * *

The control language needed to activate, compile, and


link GAMESS on your brand of computer involves several
scripts, namely:
COMP will compile a single quantum chemistry module.
COMPALL compiles all quantum chemistry source modules.
COMPDDI will compile the distributed data interface,
and
generate the process kickoff program ddikick.x.
LKED will link-edit together the quantum chemistry
object code, and the DDI library, to produce a
binary executable games.x.
RUNGMS will run a GAMESS job, in serial or parallel.
RUNALL uses RUNGMS to run all the example jobs.
There are files related to some utility programs:
MBLDR.* model builder (internal to Cartesian)
CARTIC.* Cartesian to internal coordinates
CLENMO.* cleans up $VEC groups
DK3.F prepare relativistic AO contractions.
There are files related to X windows graphics, in:
~/gamess/graphics
although if you have a MacIntosh (lucky you!), you should
obtain Brett Bode’s MacMolPlt program which has the same
capabilities, and much more.
Programmer’s Reference 5-5

Running Distributed Data Parallel GAMESS

GAMESS consists of many FORTRAN files implementing its


quantum chemistry, and some C language files implementing
the Distributed Data Interface (DDI). The directions for
compiling DDI, configuring the system parameters to permit
execution of DDI programs, and how to use the ‘ddikick.x’
program which “kicks off” GAMESS processes may be found in
‘readme.ddi’. If you are not the person installing the
GAMESS software, you can skip reading that.

Efficient use of GAMESS requires an understanding of


three critical issues: The first is the difference between
two types of memory (replicated MEMORY and distributed
MEMDDI) and how these relate to the physical memory of the
computer which you are using. Second, you must understand
to some extent the degree to which each type of computation
scales so that the proper number of nodes is selected.
Finally, many systems run -two- GAMESS processes on every
processor, and if you read on you will find out why this is
so.

Since all code needed to implement the Distributed Data


Interface (DDI) is provided with the GAMESS source code
distribution, the program compiles and links ready for
parallel execution on all machine types. Of course, you
may choose to run on only one processor, in which case
GAMESS will behave as if it is a sequential code, and the
full functionality of the program is available.

parallelization history

We began to parallelize GAMESS in 1991 as part of the


joint ARPA/Air Force piece of the Touchstone Delta project.
Today, nearly all ab initio methods run in parallel,
although some of these still have a step or two running
sequentially only. Only the RHF+CI gradients have no
parallel method coded. We have not parallelized the semi-
empirical MOPAC runs, and probably never will. Additional
parallel work occurred as a result of a DoD CHSSI software
initiative in 1996. This led to the DDI-based parallel
RHF+MP2 gradient program, after development of the DDI
programming toolkit itself. Since 2002, the DoE program
SciDAC has sponsored additional parallelization. The DDI
toolkit has been used since its 1999 introduction to add
codes for UHF+MP2 gradient, ROHF+ZAPT2 energy, and MCSCF
Programmer’s Reference 5-6
wavefunctions as well as their analytic Hessians or MCQDPT2
energy correction.

In 1991, the parallel machine of choice was the Intel


Hypercube although small clusters of workstations could
also be used as a parallel computer. In order to have
the best blend of portability and functionality, we chose
in 1991 to use the TCGMSG message passing library rather
than one of the early vendor's specialized libraries. As
the major companies began to market parallel machines, and
as MPI version 1 emerged as a standard, we began to use
MPI on some equipment in 1996, while still using the very
resilient TCGMSG library on everything else. However, in
June 1999, we retired our old friend TCGMSG when the
message passing library used by GAMESS changed to the
Distributed Data Interface, or DDI. An SMP-optimized
version of DDI was included with GAMESS in April 2004.

Three people have been extremely influential upon the


current parallel methodology. Theresa Windus, a graduate
student in the early 1990s, created the first parallel
versions. Graham Fletcher, a postdoc in the late 1990s,
is responsible for the addition of distributed data
programming concepts. Ryan Olson rewrote the DDI software
in 2003-4 to support the modern SMP architectures well, and
this was released in April 2004 as our standard message
passing implementation.

DDI compute and data server processes

DDI contains the usual parallel programming calls, such


as initialization/closure, point to point messages, and
the collective operations global sum and broadcast. These
simple parts of DDI support all parallel methods developed
in GAMESS from 1991-1999, which were based on replicated
storage rather than distributed data. However, DDI also
contains additional routines to support distributed memory
usage.

DDI attempts to exploit the entire system in a scalable


way. While our early work concentrated on exploiting the
use of p processors and p disks, it required that all data
in memory be replicated on every one of the p nodes. The
use of memory also becomes scalable only if the data is
distributed across the aggregate memory of the parallel
machine. The concept of distributed memory is contained in
the Remote Memory Access portion of MPI version 2, but so
far MPI-2 is not available from American computer vendors.
Programmer’s Reference 5-7
The original concept of distributed memory was implemented
in the Global Array toolkit of Pacific Northwest National
Laboratory (see http://www.emsl.pnl.gov/pub/docs/global).

Basically, the idea is to provide three subroutine


calls
to access memory on remote nodes: PUT, GET, and ACCUMULATE.
These give access to a class of memory which is assumed to
be slower than local memory, but faster than disk:

<--- fastest slowest --->


registers cache(s) local_memory remote_memory disks tapes
<--- smallest biggest --->

Because DDI accesses memory on other nodes by means of an


explicit subroutine call, the programmer is aware that a
message must be transmitted. This awareness of the access
overhead should encourage algorithms that transfer many
data items in a single message. Use of a subroutine call
to reach remote memory is a recognition of the non-uniform
memory access (NUMA) nature of parallel computers. In
other words, the Distributed Data Interface (DDI) is an
explicitly message passing implementation of global shared
memory.

In order to have one node pass data items to a second


node when the second node needs them, without significant
delay, the computing job on the first node must interrupt
its computation briefly to furnish the data. This type of
communication is referred to as "one sided messages" or
"active messages" since the first node is an unwitting
participant in the process, which is driven entirely by the
requirements of the second node.
Programmer’s Reference 5-8

The Cray T3E has a library named SHMEM to support this


type of one sided messages (and good hardware support for
this too) so, on the T3E, GAMESS runs as a single process
per CPU. Its memory image looks like this:

node 0 node 1
p=0 p=1
--------------- ---------------
| GAMESS | | GAMESS |
| quantum | | quantum |
| chem code | | chem code |
--------------- ---------------
| DDI code | | DDI code |
--------------- --------------- input keywords:
| replicated | | replicated | <-- MEMORY
| data | | data |
-----------------------------------------
| | | | | | <-- MEMDDI
| | distributed| | distributed | |
| | data | | data | |
| | | | | |
| | | | | |
| | | | | |
| --------------- --------------- |
-----------------------------------------

where the box drawn around the distributed data is meant to


imply that a large data array is residing in the memory of
all nodes (in this example, half on one and half on the
other).

Note that the input keyword MEMORY gives the amount of


storage used to duplicate small matrices on every node,
while MEMDDI gives the -total- distributed memory required
by the job. Thus, if you are running on p nodes, the
memory that is used on any given node is

total on any 1 node = MEMORY + MEMDDI/p

Since MEMDDI is very large, its units are in millions of


words. The keyword MEMORY is in units of words (64 bit
quantity) and so you must either convert units carefully
or use the MWORDS synonym for MEMORY (for which the units
are also millions of words). Since good execution speed
requires that you not exceed the physical memory belonging
to your nodes, it is important to understand that when
MEMDDI is large, you will need to choose a sufficiently
large number of nodes to keep the memory on each node
Programmer’s Reference 5-9
reasonable.

To repeat, the DDI philosophy is to add more processors


not just for their compute performance or extra disk space,
but also to aggregate a very large total memory. Bigger
problems will require more nodes to obtain sufficiently
large total memories! We will give an example of how you
can estimate the number of nodes a little ways below.

If the GAMESS task running as process p=1 in the above


example needs some values previously computed, it issues a
call to DDI_GET. The DDI routines in process p=1 then
figure out where this "patch" of data in the big
rectangular
distributed storage actually resides. Suppose this is on
process p=0. The DDI routines in p=1 send a message to
p=0 to interupt its computations, after which p=0 sends a
bulk data message to process p=1's buffer. This buffer
resides in part of the replicated storage of p=1, where
computations can occur. Note that the quantum chemistry
layer of process p=1 was sheltered from most of the details
regarding which node owned the patch of data that process
p=1 wanted to obtain. These details are managed by the
DDI layer.

Note that with the exception of DDI_ACC's addition of


new terms into a distributed array, no arithmetic is done
directly upon the distributed data. Instead, distributed
data is accesse only by DDI_GET, DDI_PUT (its counterpart
for storage of data items), and DDI_ACC (which accumulates
new terms into the distributed data). DDI_GET and DDI_PUT
can be thought of as analogous to FORTRAN READ and WRITE
statements that transfer data between disk storage and
local
memory where computations may occur.

It is the programmer's challenge to minimize the


number of GET/PUT/ACC calls, and to design algorithms that
maximize the chance that the patches of data are actually
within the local node's portion of the distributed data.
Programmer’s Reference 5-10

Since the SHMEM library is available only on a few


machines, all other platforms adopt the following memory
model, which involves –two- GAMESS processes running on
every processor:

node 0 node 1
p=0 p=1
--------------- ---------------
| GAMESS X| | GAMESS X| compute
| quantum | | quantum | processes
| chem code | | chem code |
--------------- ---------------
| DDI code | | DDI code | Input keyword:
--------------- ---------------
| replicated | | replicated | <-- MEMORY
| data | | data |
--------------- ---------------

p=2 p=3
--------------- ---------------
| GAMESS | | GAMESS | data
| quantum | | quantum | servers
| chem code | | chem code |
--------------- ---------------
| DDI code X| | DDI code X|
--------------- ---------------
----------------------------------------- Input
keyword:
| | | | | | <-- MEMDDI
| | distributed| | distributed | |
| | data | | data | |
| | | | | |
| | | | | |
| | | | | |
| --------------- --------------- |
-----------------------------------------

The first half of the processes do quantum chemistry, and


the X indicates that they spend most of their time
executing some sort of chemistry. Hence the name "compute
process". Soon after execution, the second half of the
processes call a DDI service routine which consists of an
infinite loop to deal with GET, PUT, and ACC requests until
such time as the job ends. The X shows that these "data
servers" execute only DDI support code. (This makes the
data server's quantum chemistry routines the equivalent of
the human appendix). The whole problem of interupts is now
in the hands of the operating system, as the data servers
Programmer’s Reference 5-11
are distinct processes. To follow the same example as
before, when the compute process p=1 needs data that turns
out to reside on node 0, a request is sent to the data
server p=2 to transfer information back to the compute
process p=1. The compute process p=0 is completely unaware
that such a transaction has occurred.

The formula for the memory required by any single node


is unchanged, if p is the total number of nodes used,
total on any 1 node = MEMORY + MEMDDI/p.

As a technical matter, if you are running on a system


where all processors are in the same node (the SGI Altix is
an example), or if you are running on an IBM SP where LAPI
assists in implementing one-sided messaging, then the data
server processes are not started. The memory model in the
illustration above is correct, if you just mentally omit
the data server processes from it. In all cases, where the
SHMEM library is not used, the distributed arrays are
created by System V memory calls, shmget/shmat, and their
associated semaphore routines. Your system may need to
be reconfigured to allow allocation of large shared memory
segments, see ‘readme.ddi’ for more details.

memory allocations and check jobs

At present, not all runs require distributed memory.


For example, in an SCF computation (no hessian or MP2 to
follow) the memory needed is on the order of the square of
the basis set size, for such quantities as the orbital
coefficients, density, Fock, overlap matrices, and so on.
These are simply duplicated on every node in the MEMORY
region. In this case the data server processes still run,
but are dormant because no distributed memory access is
attempted.

However, closed and open shell MP2 calculations, MCSCF


wavefunctions, and their analytic hessian or MCQPDT energy
correction do use distributed memory when run in parallel.
Thus it is important to know how to obtain the correct
value for MEMDDI in a check run.

Check runs (EXETYP=CHECK) need to run quickly, and


the fastest turn around always comes on one node only.
Runs which do not currently exploit MEMDDI distributed
storage will formally allocate their MEMORY needs, and
feel out their storage needs while skipping almost all of
the real work. Since MEMORY is replicated, the amount
Programmer’s Reference 5-12
that is needed on 1 node remains unchanged if you later
do the true computation on more than 1 node.

Check jobs which involve MEMDDI storage are a little


bit trickier. As noted, we want to run on only 1 node
to get fast turn around. However, MEMDDI is typically a
large amount of memory, and this is unlikely to be
available on a single node. The solution is that the
data server process does not actually allocate the
MEMDDI storage, instead it just remembers what you gave
as input and checks to see if this will be adequate. So,
you can input MEMDDI=1000 (1000 million words is equal
to 1,000 * 1,000,000 * 8 = 8 GBytes and run this check
job on a computer with only 256 MB of RAM.

Of course, the actual computation will have to run on


a large number of such processors. Let us continue with
this example of a run requiring 8 GBytes of distributed
data on 256 MB nodes. Suppose that MEMORY is 2500000 in
this case (when MEMDDI is used, MEMORY is typically just
a few million words). We need to reserve some memory
for the operating system (16 MBytes, say) and for the
GAMESS program and local storage (approx 16 MB, it is a
big program, and the compute processes should be swapped
into memory). Thus our hypothetical 256 MB node has
224 MB available, assuming no one else is running. The
rest of the computations proceed in million/mega words,
so the available memory per node is 224/8 = 28. We must
choose the number of processors p to satisfy
needed <= available
MEMORY + MEMDDI/p <= free physical memory
2.5 + 1000/p <= 28
so this example requires p >= 39 compute processes.

One more subtle point about CHECK runs with MEMDDI is


that since you are running on 1 node only, the code does
not know that you wish to run the parallel algorithm
instead of the sequential algorithm. You must force the
CHECK job into the parallel section of the program by
$system parall=.true. $end
There's no harm leaving this line in for the true runs,
as any job with more than one compute process is parallel
regardless of the input value PARALL.

The check run for MCQDPT jobs will print three times
a line like this
MAXIMUM MEMDDI THAT CAN BE USED IN ... IS x MWORDS
Typically the 2nd such step, transforming over all
occupied and virtual canonical orbitals, will be the
Programmer’s Reference 5-13
largest of the three requirements. Its size can be
guesstimated before running, as
(Nao*Nao+Nao)/2 * ((Nocc*Nocc+Nocc)/2 + Nocc*Nvirt)
where Nocc = NMOFZC+NMODOC+NMOACT, Nvirt=NMOEXT, and
Nao is the size of the atomic basis. Unlike the closed
shell MP2 program, this section still does extensive
I/O operations even when MEMDDI is used, so it may be
useful to consider the three input keywords DOORD0,
PARAIO, and DELSCR when running this code.

representative performance examples

This section describes the way in which the various


quantum chemistry computations run in parallel, and shows
some typical performance data. This should give you as the
user some idea how many nodes can be efficiently used for
various SCFTYP and RUNTYP jobs

The performance data you will see below were obtained


on a 16 node Intel Pentium II Linux (Beowulf-type) cluster
costing $49,000, of which $3,000 went into the switched
Fast Ethernet component. 512 MB/node means this cluster
has an aggregate memory of 8 GB. For more details, see
http://www.msg.ameslab.gov/GAMESS/dist.pc.shtml.
This is a low quality network, which exposes jobs with
higher communication requirements, by noting when the wall
time is much longer than the CPU.

---

The HF wavefunctions can be evaluated in parallel using


either conventional disk storage of the integrals, or via
direct recomputation of the integrals. Some experimenting
will show which is more effective on your hardware. As an
example of the scaling performance of RHF, ROHF, UHF, or
GVB jobs that involve only computation of the energy or its
gradient, we include here a timing table from the 16 node
PC cluster. The molecule is luciferin, which together with
the enzyme luciferase is involved in firefly light
production. The chemical formula is C11N2S2O3H8, and
RHF/6-31G(d) has 294 atomic orbitals. There's no molecular
symmetry. The run is done as direct SCF, and the CPU
timing data is

p=1 p=2 p=4 p=8 p=16


1e- ints 1.1 0.6 0.4 0.3 0.2
Huckel guess 14 12 11 10 10
Programmer’s Reference 5-14
15 RHF iters 5995 2982 1493 772 407
properties 6.0 6.6 6.6 6.8 6.9
1e- gradient 9.7 4.7 2.3 1.2 0.7
2e- gradient 1080 541 267 134 68
---- ---- ---- ---- ----
total CPU 7106 3547 1780 925 492 seconds
total wall 7107 3562 1815 950 522 seconds

Note that direct SCF should run with the wall time very
close to the CPU time as there is essentially no I/O and
not that much communication (MEMDDI storage is not used by
this kind of run). Running the same molecule as
DFTTYP=B3LYP yields

p=1 p=2 p=4 p=8 p=16


1e- ints 1.1 0.7 0.3 0.3 0.2
Huckel guess 14 12 10 10 9
23 DFT iters 14978 7441 3681 1876 961
properties 6.6 6.4 6.5 7.0 6.5
1e- gradient 9.7 4.7 2.3 1.3 0.7
2e- grid grad 5232 2532 1225 595 303
2e- AO grad 1105 550 270 136 69
---- ---- ---- ---- ----
total CPU 21347 10547 5197 2626 1349
total wall 21348 10698 5368 2758 1477

and finally if we run an RHF analytic hessian, using AO


basis integrals, the result is

p=1 p=2 p=4 p=8 p=16


1e- ints 1.2 0.6 0.4 0.3 0.2
Huckel guess 14 12 10 10 10
14 RHF iters 5639 2851 1419 742 390
properties 6.4 6.5 6.6 7.0 6.7
1e- grd+hss 40.9 20.9 11.9 7.7 5.8
2e- grd+hss 21933 10859 5296 2606 1358
CPHF 40433 20396 10016 5185 2749
----- ----- ----- ---- ----
total CPU 68059 34146 16760 8559 4519
total wall 68102 34273 17430 9059 4978

CPU speedups for 1->16 processors for RHF gradient, DFT


gradient, and RHF analytic hessian are 14.4, 15.8, and 15.1
times faster, respectively. The wall clock times are close
to the CPU time, indicating very little communication is
involved. If you are interested in an explanation of how
the parallel SCF is implimented, see the main GAMESS paper,
M.W.Schmidt, K.K.Baldridge, J.A.Boatz, S.T.Elbert,
M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga,
Programmer’s Reference 5-15
K.A.Nguyen, S.J.Su, T.L.Windus, M.Dupuis, J.A.Montgomery
J.Comput.Chem. 14, 1347-1363(1993)

---

The CIS energy and gradient code is also programmed to


have the construction of Fock-like matrices as its
computational kernel. Its scaling is therefore very
similar to that just shown, for porphin C20N4H14, DH(d,p)
basis, 430 AOs:
p=1 p=2 p=4 p=8 p=16
setup 25 25 25 25 25
1e- ints 5.1 2.7 1.5 1.0 0.6
orb. guess 30 25 23 22 21
RHF iters 1647 850 452 251 152
RHF props 19 19 19 19 19
CIS energy 36320 18166 9098 4620 2398
CIS lagrang 6092 3094 1545 786 408
CPHF 20099 10183 5163 2688 1444
CIS density 2468 1261 632 324 170
CIS props 19 19 19 19 19
1e- grad 40.9 18.2 9.2 4.7 2.4
2e- grad 1644 849 423 223 122
----- ----- ---- ---- ----
total CPU 68424 34526 17420 8994 4791
total wall 68443 34606 17853 9258 4985
which is a speedup of 14.3 for 1->16.

---

For the next type of computation, we discuss the MP2


correction. For closed shell RHF + MP2 and unrestricted
UHF + MP2, the gradient program runs in parallel using
distributed memory, MEMDDI. In addition, the ROHF + MP2
energy correction for OSPT=ZAPT runs in parallel using
distributed memory, but OSPT=RMP does not use MEMDDI in
parallel jobs. All distributed memory parallel MP2 runs
resemble RHF+MP2, which is therefore the only example given
here.

The example is a benzoquinone precursor to hongconin, a


cardioprotective natural product. The formula is C11O4H10,
and 6-31G(d) has 245 AOs. There are 39 valence orbitals
included in the MP2 treatment, and 15 core orbitals.
MEMDDI must be 156 million words, so the memory computation
that was used above tells us that our 512 MB/node PC
cluster must have at least three processors to aggregate
the required MEMDDI. MOREAD was used to provide converged
RHF orbitals, so only 3 RHF iterations are performed. The
Programmer’s Reference 5-16
timing data are CPU and wall times (seconds) in the 1st/2nd
lines:

p=3 p=4 p=12 p=16


RHF iters 241 181 65 51
243 184 69 55
MP2 step 5,953 4,399 1,438 1,098
7,366 5,669 2,239 1,700
2e- grad 1,429 1,135 375 280
1,492 1,183 413 305
----- ----- --- ---
total CPU 7,637 5,727 1,890 1,440
total wall 9,116 7,053 2,658 2,077

3-->12 4-->16
CPU speedup 4.04 3.98
wall speedup 3.43 3.40

The wall clock time will be closer to the CPU time if the
quality of the network between the computer is improved
(remember, this run used just switched Fast Ethernet). As
noted, the number of nodes is more influenced by a need to
aggregate the necessary total MEMDDI, more than by concerns
about scalability. MEMDDI is typically large for MP2
parallel runs, as it is proportional to the number of
occupied orbitals squared times the number of AOs squared.

For more details on the distributed data parallel MP2


program, see
G.D.Fletcher, A.P.Rendell, P.Sherwood
Mol.Phys. 91, 431-438(1997)
G.D.Fletcher, M.W.Schmidt, M.S.Gordon
Adv.Chem.Phys. 110, 267-294 (1999)
G.D.Fletcher, M.W.Schmidt, B.M.Bode, M.S.Gordon
Comput.Phys.Commun. 128, 190-200 (2000)

---

The next type of computation we will consider is


analytic computation of the nuclear Hessian (force constant
matrix). The performance of the RHF program, based on AO
integrals, was given above, as its computational kernel
(Fock-like builds) scales just as the SCF itself. However,
for high spin ROHF, low spin open shell SCF and TCSCF (both
done with GVB), the only option is MO basis integrals. The
integral transformation is parallel according to
T.L.Windus, M.W.Schmidt, M.S.Gordon
Theoret.Chim.Acta 89, 77-88(1994).
Programmer’s Reference 5-17
It distributes ‘passes’ over nodes, so as to parallelize
the transformation’s CPU time but not the replicated
memory, or the AO integral time. Finally the response
equation step is hardly parallel at all. The test example
is an intermediate in the ring opening of silacyclobutane,
GVB-PP(1) or TCSCF, 180 AOs for 6-311G(2d,2p):
p=1 p=2 p=4 p=8 p=16
2e- ints 83 42 21 11 5
GVB iters 648 333 179 104 67
replicate 2e- n/a 81 81 81 82
transf. 476 254 123 67 51
1e- grd+hss 7 4 2 2 1
2e- grd+hss 4695 2295 1165 596 313
CP-TCSCF 344 339 331 312 325
---- ---- ---- ---- ----
total CPU 6256 3351 1904 1189 848
total wall 6532 3538 2072 1399 1108

Clearly, the final response equation (CPHF) step is a


sequential bottleneck, as is the fact that the orbital
hessian in this step is stored entirely on the disk space
of node 0. Since the integral transformation is run in
replicated MEMORY rather than distributing this, and since
it also needs a duplicated AO integral file be stored on
every node, the code is clearly not scalable to very many
processors. Typically we would not request more than 3
or 4 processors for an analytic ROHF or GVB hessian.

The final analytic hessian type is for MCSCF. The


scalability of the MCSCF wavefunction will be given just
below, but the response equation step for MCSCF is clearly
quite scalable. The integral transformation for the
response equation step uses distributed memory MEMDDI, and
should scale like the MP2 program (documented above). The
test case has 8e- in 8 orbitals, and the time reflect this,
with most of the work involving the 4900 determinants.
Total speedup for 4->16 is 4.11, due to luckier work
distributing for 16 CPUs:

p=4 p=16
MCSCF wfn 113.5 106.1
DDI transf. 68.4 19.3
1e- grd+hss 1.5 0.6
2e- grd+hss 2024.9 509.8
CPMCHF RHS 878.8 225.8 (RHS=right hand
sides)
CPMCHF iters 115343.5 27885.9
-------- --------
total CPU 118430.8 28747.6
Programmer’s Reference 5-18
total wall 119766.0 30746.4

This code can clearly benefit from using many processors,


with scalability of the MCSCF step itself almost moot.

---

Now lets turn to MCSCF energy/gradient runs. We will


illustrate two convergers, SOSCF and then FULLNR. The
former uses a ‘pass’ type of integral transformation (ala
the GVB hessian job above), and runs in replicated memory
only (no MEMDDI). The FULLNR converger is based on the MP2
program’s distributed memory integral transformation, so it
uses MEMDDI. In addition, the parallel implementation of
the FULLNR step never forms the orbital hessian explicitly,
doing Davidson style iterations to predict the new
orbitals. Thus the memory demand is almost entirely
MEMDDI.

The example we choose is at a transition state for the


water molecule assisted proton transfer in the first
excited stat of 7-azaindole. The formula is C7N2H6(H2O),
there are 190 active orbitals, and the active space is the
10 pi electrons in 9 pi orbitals of the azaindole portion.
There are 15,876 determinants used in the MCSCF
calculation, and 5,292 CSFs in the perturbation calculation
to follow. See Figure 6 of G.M.Chaban, M.S.Gordon
J.Phys.Chem.A 103, 185-189(1999) if you are interested in
this chemistry. The timing data for the SOSCF converger
are

p=1 p=2 p=4 p=8 p=16


dup. 2e- ints 327.6 331.3 326.4 325.8 326.5
transform. 285.1 153.6 88.4 57.8 47.3
det CI 39.3 39.4 38.9 38.3 38.1
2e- dens. 0.4 0.5 0.5 0.5 0.5
orb. update 39.2 25.9 17.4 12.8 11.0
iters 2-16 5340.0 3153.5 2043.7 1513.6 1308.5
1e- grad 5.3 2.3 1.3 0.7 0.4
2e- grad 695.6 354.9 179.4 93.2 50.9
------ ------ ------ ------ ------
total CPU 6,743 4,071 2,705 2,052 1,793
total wall 13,761 8,289 4,986 3,429 3,899

whereas the FULLNR convergers runs like this

p=1 p=2 p=4 p=8 p=16


2e- DDI trans. 2547 1385 698 354 173
det. CI 39 39 38 38 38
Programmer’s Reference 5-19
DM2 0.5 0.5 0.5 0.5 0.5
FULLNR 660 376 194 101 51
iters 2-9 24324 13440 6942 3669 1940
1e- grad 5.3 2.3 1.2 0.7 0.4
2e- grad 700 352 181 95 51
------ ------ ---- ---- ----
total CPU 28,288 15,605 8,066 4,268 2,265
total wall 28,290 20,719 12,866 8,292 5,583

The first iteration is broken down into its primary steps


from the integral transformation to the orbital update,
inclusive. The SOSCF program is clearly faster, and should
be used when the number of processors is modest (say up to
8), however the largest molecules will benefit from using
more processors and the much more scalable FULLNR program.

One should note that the CI calculation was done by


CISTEP=ALDET, which is not presently scalable at all. This
doesn’t matter for small active spaces like 10e- in 9
orbitals, as you can see above, but this program’s use of
replicated memory and large CPU time for big active spaces
limits MCSCF scalability in the large active space limit.

Now lets consider the second order pertubation


correction for this example. As noted, it is an excited
state, so the test corrects two states simultaneously (S0
and S1). The parallel multireference perturbation program
is described in
H.Umeda, S.Koseki, U.Nagashima, M.W.Schmidt
J.Comput.Chem. 22, 1243-1251 (2001)
The run is given the converged S1 orbitals, so that it can
skip directly to the perturbation calculation:
p=1 p=2 p=4 p=8 p=16
2e- ints 332 332 329 328 331
MCQDPT 87921 43864 22008 11082 5697
----- ----- ----- ----- -----
total CPU 88261 44205 22345 11418 6028
total wall 91508 45818 23556 12350 6852
This corresponds to a speedup for 1->16 of 14.6.

---

In summary, most ab initio computations will run in


less time on more than one node. However, some things can
be run only on 1 node, namely
semi-empirical runs
RHF+CI gradient
Coupled-Cluster calculations
Programmer’s Reference 5-20
Some steps run with little or no speedup, forming
sequential bottlenecks that limit scalability. They do not
prevent jobs from running in parallel, but restrict the
total number of nodes that can be effectively used:
ROHF/GVB hessians: solution of response equations
MCSCF: Hamiltonian and 2e- density matrix (CI)
energy localizations: the orbital localization step
transition moments/spin-orbit: the final property step
MCQDPT reference weight option
Future versions of GAMESS will address these bottlenecks.

A short summary of the useful number of nodes (based on


data like the above) would be
RHF, ROHF, UHF, GVB energy/gradient, their
DFT analogs, and CIS excited states 16-32+
MCSCF energy/gradient
SOSCF 4-8
FULLNR 8-32+
analytic hessians
RHF 16-32+
ROHF/GVB 4-8
MCSCF 64-128+
MPLEVL=2
RHF, UHF, ROHF OSPT=ZAPT 8-256+
ROHF OSPT=RMP energy 8
MCSCF 16+
Programmer’s Reference 5-21

Altering program limits

Almost all arrays in GAMESS are allocated dynamically,


but some variables must be held in common as their use is
ubiquitous. An example would be the common block which
holds the basis set. The following Unix script, which we
call 'mung', changes the PARAMETER statements that set
various limitations:

#!/bin/csh
#
# automatically change GAMESS' built-in dimensions
#
chdir /u1/mike/gamess/source
#
foreach FILE (*.src)
set FILE=$FILE:r
echo ===== redimensioning in $FILE =====
echo "C 01 JAN 05 - SELECT NEW DIMENSIONS" \
> $FILE.munged
sed -e "/MXATM=500/s//MXATM=100/" \
-e "/MXAO=2047/s//MXAO=2047/" \
-e "/MXRT=100/s//MXRT=100/" \
-e "/MXSH=1000/s//MXSH=1000/" \
-e "/MXGSH=30/s//MXGSH=30/" \
-e "/MXGTOT=5000/s//MXGTOT=5000/" \
-e "/MXFRG=50/s//MXFRG=1/" \
-e "/MXDFG=5/s//MXDFG=1/" \
-e "/MXPT=100/s//MXPT=1/" \
-e "/MXSP=100/s//MXSP=1/" \
-e "/MXTS=2500/s//MXTS=1/" \
$FILE.src >> $FILE.munged
mv $FILE.munged $FILE.src
end
exit

In this script,
MXATM = max number of atoms
MXAO = max number of basis functions
MXRT = max number of CI roots
MXSH = max number of symmetry unique shells
MXGSH = max number of Gaussians per shell
MXGTOT= max number of symmetry unique Gaussians
MXFRG = max number of effective fragment potentials
MXDFG = max number of different effective fragments
MXPT = max number of effective fragment points
MXSP = max number of spheres (sfera) in PCM
Programmer’s Reference 5-22
MXTS = max number of tesserae in PCM

The script shows how to -minimize- memory use, by a


small decrease in the number of atoms, while turning off
the effective fragment and PCM dimensioning. Very little
memory can be saved by reducing the other adjustable
parameters. Of course, the 'mung' script can also be used
to increase the dimensions!
Programmer’s Reference 5-23

Names of source code modules

The source code for GAMESS is divided into a number


of sections, called modules, each of which does related
things, and is a handy size to edit. The following is a
list of the different modules, what they do, and notes on
their machine dependencies.

machine
module description dependency
------- ------------------------- ----------
ALDECI Ames Lab determinant full CI code 1
ALGNCI Ames Lab determinant general CI code
BASECP SBKJC and HW valence basis sets
BASEXT DH, MC, 6-311G extended basis sets
BASHUZ Huzinaga MINI/MIDI basis sets to Xe
BASHZ2 Huzinaga MINI/MIDI basis sets Cs-Rn
BASN21 N-21G basis sets
BASN31 N-31G basis sets
BASSTO STO-NG basis sets
BLAS level 1 basic linear algebra subprograms
CCAUX auxiliary routines for CC calculations
CCSDT renormalized CCSD(T) program 1
CHGPEN screening for charge penetration of EFPs
CISGRD CI singles and its gradient 1
COSMO conductor-like screening model
CPHF coupled perturbed Hartree-Fock 1
CPMCHF multiconfigurational CPHF 1
CPROHF open shell/TCSCF CPHF 1
DDI message passing library interface code 9
DDISHM message passing code (SHMEM interface) 9
DDIGA message passing code for Fujitsu PP 9
DELOCL delocalized coordinates
DFT grid-free DFT drivers 1
DFTAUX grid-free DFT auxiliary basis integrals
DFTEXC grid DFT functionals
DFTGRD grid DFT implementation
DFTINT grid-free DFT integrals 1
DFTFUN grid-free DFT functionals
DGEEV general matrix eigenvalue problem
DMULTI Amos' distributed multipole analysis
DRC dynamic reaction coordinate
ECP pseudopotential integrals
ECPDER pseudopotential derivative integrals
ECPLIB initialization code for ECP
ECPPOT HW and SBKJC internally stored potentials
EIGEN Givens-Householder, Jacobi diagonalization
Programmer’s Reference 5-24
EFDRVR fragment only calculation drivers
EFELEC fragment-fragment interactions
EFGRD2 2e- integrals for EFP numerical hessian
EFGRDA ab initio/fragment gradient integrals
EFGRDB " " " " "
EFGRDC " " " " "
EFINP effective fragment potential input
EFINTA ab initio/fragment integrals
EFINTB " " " "
EFPAUL effective fragment Pauli repulsion
EFPCOV EFP style QM/MM boundary code
EOMCC equation of motion excited state CCSD
FFIELD finite field polarizabilities
FRFMT free format input scanner
FSODCI determinant based second order CI
GAMESS main program, single point energy
and energy gradient drivers, misc.
GLOBOP Monte Carlo fragment global optimizer
GRADEX traces gradient extremals
GRD1 one electron gradient integrals
GRD2A two electron gradient integrals 1
GRD2B specialized sp gradient integrals
GRD2C general spdfg gradient integrals
GUESS initial orbital guess
GUGDGA Davidson CI diagonalization 1
GUGDGB " " " 1
GUGDM 1 particle density matrix
GUGDM2 2 particle density matrix 1
GUGDRT distinct row table generation
GUGEM GUGA method energy matrix formation 1
GUGSRT sort transformed integrals 1
GVB generalized valence bond HF-SCF 1
HESS hessian computation drivers
HSS1A one electron hessian integrals
HSS1B " " " "
HSS2A two electron hessian integrals 1
HSS2B " " " "
INPUTA read geometry, basis, symmetry, etc.
INPUTB " " " "
INPUTC " " " "
INT1 one electron integrals
INT2A two electron integrals 1
INT2B " " "
IOLIB input/output routines,etc. 2
LAGRAN CI Lagrangian matrix 1
LOCAL various localization methods 1
LOCCD LCD SCF localization analysis
LOCPOL LCD SCF polarizability analysis
MCCAS FOCAS/SOSCF MCSCF calculation 1
Programmer’s Reference 5-25
MCJAC JACOBI MCSCF calculation
MCQDPT multireference perturbation theory 1
MCQDWT weights for MR-perturbation theory
MCQUD QUAD MCSCF calculation 1
MCSCF FULLNR MCSCF calculation 1
MCTWO two electron terms for FULLNR MCSCF 1
MCPINP model core potential input
MCPINT model core potential integrals
MCPLIB model core potential library
MM23 MMCC(2,3) corrections to EOMCCSD
MOROKM Morokuma energy decomposition 1
MP2 2nd order Moller-Plesset 1
MP2DDI distributed data parallel MP2
MP2GRD CPHF and density for MP2 gradients 1
MPCDAT MOPAC parameterization
MPCGRD MOPAC gradient
MPCINT MOPAC integrals
MPCMOL MOPAC molecule setup
MPCMSC miscellaneous MOPAC routines
MTHLIB printout, matrix math utilities
NAMEIO namelist I/O simulator
NMR nuclear magnetic resonance shifts
OLIX interface code
ORDINT sort atomic integrals 1
PARLEY communicate to other programs
PCM Polarizable Continuum Model setup
PCMCAV PCM cavity creation
PCMCV2 PCM cavity for gradients
PCMDER PCM gradients
PCMDIS PCM dispersion energy
PCMPOL PCM polarizabilities
PCMVCH PCM repulsion and escaped charge
PRPEL electrostatic properties
PRPLIB miscellaneous properties
PRPPOP population properties
QEIGEN 128 bit precision RI for relativity 11
QFMM quantum fast multipole method
QMFM additional QFMM code
QMMM temporary dummy routines
QREL relativistic transformations
RHFUHF RHF, UHF, and ROHF HF-SCF 1
RXNCRD intrinsic reaction coordinate
RYSPOL roots for Rys polynomials
SCFMI molecular interaction SCF code
SCFLIB HF-SCF utility routines, DIIS code
SCRF self consistent reaction field
SOBRT full Breit-Pauli spin-orbit compling
SOFFAC spin-orbit matrix element form factors
SOZEFF 1e- spin-orbit coupling terms
Programmer’s Reference 5-26
STATPT geometry and transition state finder
SURF PES scanning
SYMORB orbital symmetry assignment
SYMSLC " " "
TDHF time-dependent Hartree-Fock NLO 1
TRANS partial integral transformation 1
TRFDM2 two particle density backtransform 1
TRNSTN CI transition moments
TRUDGE nongradient optimization
UMPDDI distributed data parallel MP2
UNPORT unportable, nasty code 3,4,5,6,7,8
VECTOR vectorized version routines 10
VIBANL normal coordinate analysis
VSCF anharmonic frequencies
ZHEEV complex matrix diagonalization
ZMATRX internal coordinates

UNIX versions use the C code ZUNIX.C for dynamic memory.


Most UNIX versions use DDISOC.C to talk to TCP/IP sockets,
and DDIKICK.C to load GAMESS for execution.

The machine dependencies noted above are:


1) packing/unpacking 2) OPEN/CLOSE statments
3) machine specification 4) fix total dynamic memory
5) subroutine walkback 6) error handling calls
7) timing calls 8) LOGAND function
9) message passing calls.
DDI.SRC runs using TCP/IP socket calls (*SOC) or MPI-1
calls (*MPI), and even contains a serial code (*SEQ),
and works in conjunction with ddisoc.c and ddikick.c
when TCP/IP sockets are in use.
DDISHM.SRC implements use of SHMEM messaging.
10) vector library calls 11) REAL*16 data type
Programmer’s Reference 5-27

Programming Conventions

The following "rules" should be adhered


to in making changes in GAMESS. These
rules are important in maintaining
portability, and should be strictly
adhered to.

Rule 1. If there is a way to do it that works on all


computers, do it that way. Commenting out statements for
the different types of computers should be your last
resort. If it is necessary to add lines specific to your
computer, PUT IN CODE FOR ALL OTHER SUPPORTED MACHINES.
Even if you don't have access to all the types of
supported hardware, you can look at the other machine
specific examples found in GAMESS, or ask for help from
someone who does understand the various machines. If a
module does not already contain some machine specific
statements (see the above list) be especially reluctant to
introduce dependencies.

Rule 2. a) Use IMPLICIT DOUBLE PRECISION(A-H,O-Z)


specification statements throughout. b) All floating
point constants should be entered as if they were in
double precision. The constants should contain a decimal
point and a signed two digit exponent. A legal constant
is 1.234D-02. Illegal examples are 1D+00, 5.0E+00, and
3.0D-2. c) Double precision BLAS names are used
throughout, for example DDOT instead of SDOT.

The source code activator ACTVTE will


automatically convert these double
precision constructs into the correct
single precision expressions for machines
that have 64 rather than 32 bit words.

Rule 3. FORTRAN 77 allows the use of generic


functions. Thus the routine SQRT should be used in place
of DSQRT, as this will automatically be given the correct
precision by the compilers. Use ABS, COS, INT, etc. Your
compiler manual will tell you all the generic names.

Rule 4. Every routine in GAMESS begins with a card


containing the name of the module and the routine. An
example is "C*MODULE xxxxxx *DECK yyyyyy". The second
star is in column 18. Here, xxxxxx is the name of the
module, and yyyyyy is the name of the routine.
Programmer’s Reference 5-28
Furthermore, the individual decks yyyyyy are stored in
alphabetical order. This rule is designed to make it
easier for a person completely unfamiliar with GAMESS to
find routines. The trade off for this is that the driver
for a particular module is often found somewhere in the
middle of that module.

Rule 5. Whenever a change is made to a module, this


should be recorded at the top of the module. The
information required is the date, initials of the person
making the change, and a terse summary of the change.

Rule 6. No lower case characters, no more than 6


letter variable names, no imbedded tabs, statements must
lie between columns 7 and 72, etc. In other words, old
style syntax.

* * *

The next few "rules" are not adhered to


in all sections of GAMESS. Nonetheless
they should be followed as much as
possible, whether you are writing new
code, or modifying an old section.

Rule 7. Stick to the FORTRAN naming convention for


integer (I-N) and floating point variables (A-H,O-Z). If
you've ever worked with a program that didn't obey this,
you'll understand why.

Rule 8. Always use a dynamic memory allocation


routine that calls the real routine. A good name for the
memory routine is to replace the last letter of the real
routine with the letter M for memory.

Rule 9. All the usual good programming techniques,


such as indented DO loops ending on CONTINUEs,
IF-THEN-ELSE where this is clearer, 3 digit statement
labels in ascending order, no three branch GO TO's,
descriptive variable names, 4 digit FORMATs, etc, etc.

The next set of rules relates to coding


practices which are necessary for the
parallel version of GAMESS to function
sensibly. They must be followed without
exception!

Rule 10. All open, rewind, and close operations on


sequential files must be performed with the subroutines
Programmer’s Reference 5-29
SEQOPN, SEQREW, and SEQCLO respectively. You can find
these routines in IOLIB, they are easy to use.

Rule 11. All READ and WRITE statements for the


formatted files 5, 6, 7 (variables IR, IW, IP, or named
files INPUT, OUTPUT, PUNCH) must be performed only by the
master task. Therefore, these statements must be enclosed
in "IF (MASWRK) THEN" clauses. The MASWRK variable is
found in the /PAR/ common block, and is true on the master
process only. This avoids duplicate output from the other
processes. At the present time, all other disk files in
GAMESS also obey this rule.

Rule 12. All error termination is done by means of


"CALL ABRT" rather than a STOP statement. Since this
subroutine never returns, it is OK to follow it with a
STOP statement, as compilers may not be happy without a
STOP as the final executable statment in a routine.
Programmer’s Reference 5-30

Parallel broadcast identifiers

GAMESS uses DDI calls to pass messages between the


parallel processes. Every message is identified by a
unique number, hence the following list of how the numbers
are used at present. If you need to add to these, look at
the existing code and use the following numbers as
guidelines to make your decision. All broadcast numbers
must be between 1 and 32767.

20 : Parallel timing
100 - 199 : DICTNRY file reads
200 - 204 : Restart info from the DICTNRY file
210 - 214 : Pread
220 - 224 : PKread
225 : RAread
230 : SQread
250 - 265 : Nameio
275 - 310 : Free format
325 - 329 : $PROP group input
350 - 354 : $VEC group input
400 - 424 : $GRAD group input
425 - 449 : $HESS group input
450 - 474 : $DIPDR group input
475 - 499 : $VIB group input
500 - 599 : matrix utility routines
800 - 830 : Orbital symmetry
900 : ECP 1e- integrals
910 : 1e- integrals
920 - 975 : EFP and SCRF integrals
980 - 999 : property integrals
1000 - 1025 : SCF wavefunctions
1030 - 1041 : broadcasts in DFT
1050 : Coulomb integrals
1200 - 1215 : MP2
1300 - 1320 : localization
1495 - 1499 : reserved for Jim Shoemaker
1500 : One-electron gradients
1505 - 1599 : EFP and SCRF gradients
1600 - 1602 : Two-electron gradients
1605 - 1620 : One-electron hessians
1650 - 1665 : Two-electron hessians
1700 - 1750 : integral transformation
1800 : GUGA sorting
1850 - 1865 : GUGA CI diagonalization
1900 - 1910 : GUGA DM2 generation
2000 - 2010 : MCSCF
Programmer’s Reference 5-31
2100 - 2120 : coupled perturbed HF
2150 - 2200 : MCSCF hessian
2300 - 2309 : spin-orbit jobs
Programmer’s Reference 5-32

Disk files used by GAMESS

These files must be defined by your control language


for executing GAMESS. For example, on UNIX the "name"
field shown below should be set in the environment to the
actual file name to be used. Most runs will open only a
subset of the files shown below, with only files 5, 6, 7,
and 10 existing in every run. Only files 3, 4, 5, 6, 7,
35, and 36 contain formatted data, all others are binary
(unformatted) files.

unit name contents


---- ---- --------
3 EXTBAS external basis set library

4 IRCDATA archive results punched by IRC runs,


restart data for numerical HESSIAN runs,
summary of results for DRC and for GLOBOP.

5 INPUT Namelist input file. This MUST be a disk


file, as GAMESS rewinds this file often.

6 OUTPUT Print output (FT06F001 on IBM mainframes)


If not defined, UNIX systems will use the
standard output for this file.

7 PUNCH Punch output. A copy of the $DATA deck,


orbitals for every geometry calculated,
hessian matrix, normal modes from FORCE,
properties output, IRC restart data, etc.

8 AOINTS Two e- integrals in AO basis

9 MOINTS Two e- integrals in MO basis

10 DICTNRY Master dictionary, for contents see below.

11 DRTFILE Distinct row table file for -CI- or -MCSCF-

12 CIVECTR Eigenvector file for -CI- or -MCSCF-

13 CASINTS semi-transformed ints for FOCAS/SOSCF MCSCF


scratch file during spin-orbit coupling

14 CIINTS Sorted integrals for -CI- or -MCSCF-

15 WORK15 GUGA loops for Hamiltonian diagonal;


Programmer’s Reference 5-33
ordered two body density matrix for MCSCF;
scratch storage during GUGA Davidson diag;
Hessian update info during 2nd order SCF;
[ij|ab] integrals during MP2 gradient
density matrices during determinant CI

16 WORK16 GUGA loops for Hamiltonian off-diagonal;


unordered GUGA DM2 matrix for MCSCF;
orbital hessian during MCSCF;
orbital hessian for analytic hessian CPHF;
orbital hessian during MP2 gradient CPHF;
two body density during MP2 gradient

17 CSFSAVE CSF data for state to state transition runs.

18 FOCKDER derivative Fock matrices for analytic hess

20 DASORT Sort file for various -MCSCF- or -CI- steps;


also used by SCF level DIIS

21 DFTINTS four center overlap ints for grid-free DFT

22 DFTGRID mesh information for grid DFT

23 JKFILE shell J, K, and Fock matrices for -GVB-;


Hessian update info during SOSCF MCSCF;
orbital gradient and hessian for QUAD MCSCF

24 ORDINT sorted AO integrals;


integral subsets during Morokuma analysis

25 EFPIND electric field integrals for EFP

26 PCMDATA gradient and D-inverse data for PCM runs

27 PCMINTS normal projections of PCM field gradients

28 MLTPL multipole moments of Gaussian basis function


products during QFMM

29 MLTPLT multipole moments of FMM boxes

30 DAFL30 direct access file for FOCAS MCSCF's DIIS;


form factor sorting for Breit spin-orbit

31 SOINTX Lx 2e- integrals during spin-orbit

32 SOINTY Ly 2e- integrals during spin-orbit


Programmer’s Reference 5-34
33 SOINTZ Lz 2e- integrals during spin-orbit

34 SORESC work space for RESC symmetrization of SO


ints

35 SIMEN energies from simulated annealing global


opt.

36 SIMCOR coords from simulated annealing global opt.

37 GCILIST determinant list for general CI program

38 HESSIAN hessian for FMO optimisations;


gradient for FMO gradients

40 SOCCDAT CSF list for SOC;


fragment densities/orbitals for FMO

41 AABB41 aabb spinor [ia|jb] integrals during UMP2

42 BBAA42 bbaa spinor [ia|jb] integrals during UMP2

43 BBBB43 bbbb spinor [ia|jb] integrals during UMP2

files 50-63 are used for MCQDPT runs.

unit name contents


---- ---- --------
50 MCQD50 Direct access file for MCQDPT, its
contents are documented in source code.
51 MCQD51 One-body coupling constants <I/Eij/J> for
CAS-CI and other routines
52 MCQD52 One-body coupling constants for perturb.
53 MCQD53 One-body coupling constants extracted
from MCQD52
54 MCQD54 One-body coupling constants extracted
further from MCQD52
55 MCQD55 Sorted 2e- AO integrals
56 MCQD56 Half transformed 2e- integrals
57 MCQD57 transformed 2e- integrals of (ii|ii) type
58 MCQD58 transformed 2e- integrals of (ei|ii) type
59 MCQD59 transformed 2e- integrals of (ei|ei) type
60 MCQD60 2e- integral in MO basis arranged for
perturbation calculations
61 MCQD61 One-body coupling constants between state
and CSF <Alpha/Eij/J>
62 MCQD62 Two-body coupling constants between state
and CSF <Alpha/Eij,kl/J>
Programmer’s Reference 5-35
63 MCQD63 canonical Fock orbitals (FORMATTED)
64 MCQD64 Spin functions and orbital configuration
functions (FORMATTED)

61 NMRINT1 derivative integrals for NMR


...
66 NMRINT6 “ “ “ “
67,68,69 for codes under development

files 70-98 are used for Coupled-Clusters,


all of these are direct access files.

unit name contents


---- ---- --------
70 CCREST T1 and T2 amplitudes for restarting
71 CCDIIS amplitude converger's scratch data
72 CCINTS MO integrals sorted by classes
73 CCT1AMP T1 amplitudes and some No*Nu intermediates
for MMCC(2,3)
74 CCT2AMP T2 amplitudes and some No**2 times Nu**2
intermediates for MMCC(2,3)
75 CCT3AMP M3 moments
76 CCVM No**3 times Nu - type main intermediate
77 CCVE No times Nu**3 - type main intermediate
80 EOMSTAR Initial vectors for EOMCCSD calculations
81 EOMVEC1 Iterative space for R1 components
82 EOMVEC2 Iterative space for R2 components
83 EOMHC1 Singly excited components of H-bar*R
(R - vectors from iterative space)
84 EOMHC2 Doubly excited components of H-bar*R
(R - vectors from iterative space)
85 EOMHHHH Intermediate used by EOMCCSD
86 EOMPPPP Intermediate used by EOMCCSD
87 EOMRAMP Converged EOMCCSD amplitudes
88 EOMRTMP Converded EOMCCSD amplitudes for meom=2
(if the max. no. of iterations exceeded)
89 EOMDG12 Diagonal part of H-bar
90 MMPP Elements of the diagonal part of
triples-triples part of H-bar
91 MMHPP Elements of the diagonal part of
triples-triples part of H-bar
92 MMCIVEC Converged CISD vectors
93 MMCIVC1 Converged CISD vectors for mci=2
(if the max. no. of iterations exceeded)
94 MMCIITR Iterative space in CISD calculations
95 MMNEXM No**3 times Nu - type main intermediate
96 MMNEXE No times Nu**3 - type main intermediate
Programmer’s Reference 5-36
97 MMNREXM No**3 times Nu - type main intermediate
98 MMNREXE No times Nu**3 - type main intermediate
Programmer’s Reference 5-37

Contents of the direct access file 'DICTNRY'

1. Atomic coordinates
2. various energy quantities in /ENRGYS/
3. Gradient vector
4. Hessian (force constant) matrix
5-6. not used
7. PTR - symmetry transformation for p orbitals
8. DTR - symmetry transformation for d orbitals
9. FTR - symmetry transformation for f orbitals
10. GTR - symmetry transformation for g orbitals
11. Bare nucleus Hamiltonian integrals
12. Overlap integrals
13. Kinetic energy integrals
14. Alpha Fock matrix (current)
15. Alpha orbitals
16. Alpha density matrix
17. Alpha energies or occupation numbers
18. Beta Fock matrix (current)
19. Beta orbitals
20. Beta density matrix
21. Beta energies or occupation numbers
22. Error function interpolation table
23. Old alpha Fock matrix
24. Older alpha Fock matrix
25. Oldest alpha Fock matrix
26. Old beta Fock matrix
27. Older beta Fock matrix
28. Oldest beta Fock matrix
29. Vib 0 gradient in FORCE (numerical hessian)
30. Vib 0 alpha orbitals in FORCE
31. Vib 0 beta orbitals in FORCE
32. Vib 0 alpha density matrix in FORCE
33. Vib 0 beta density matrix in FORCE
34. dipole derivative tensor in FORCE.
35. frozen core Fock operator
36. Lagrangian multipliers
37. floating point part of common block /OPTGRD/
int 38. integer part of common block /OPTGRD/
39. ZMAT of input internal coords
int 40. IZMAT of input internal coords
41. B matrix of redundant internal coords
42. not used.
43. Force constant matrix in internal coordinates.
44. SALC transformation
45. symmetry adapted Q matrix
46. S matrix for symmetry coordinates
Programmer’s Reference 5-38
47. ZMAT for symmetry internal coords
int 48. IZMAT for symmetry internal coords
49. B matrix
50. B inverse matrix
51. overlap matrix in Lowdin basis,
temp Fock matrix storage for ROHF
52. genuine MOPAC overlap matrix
53. MOPAC repulsion integrals
54. exchange integrals for screening
55. orbital gradient during SOSCF MCSCF
56. orbital displacement during SOSCF MCSCF
57. orbital hessian during SOSCF MCSCF
58. reserved for Pradipta
59. Coulomb integrals in Ruedenberg localizations
60. exchange integrals in Ruedenberg localizations
61. temp MO storage for GVB and ROHF-MP2
62. temp density for GVB
63. dS/dx matrix for hessians
64. dS/dy matrix for hessians
65. dS/dz matrix for hessians
66. derivative hamiltonian for OS-TCSCF hessians
67. partially formed EG and EH for hessians
68. MCSCF first order density in MO basis
69. alpha Lowdin populations
70. beta Lowdin populations
71. alpha orbitals during localization
72. beta orbitals during localization
73. alpha localization transformation
74. beta localization transformation
75. fitted EFP interfragment repulsion values
76. model core potential information
77. model core potential information
78. "Erep derivative" matrix associated with F-a terms
79. "Erep derivative" matrix associated with S-a terms
80. EFP 1-e Fock matrix including induced dipole terms
81. not used
82. MO-based Fock matrix without any EFP contributions
83. LMO centroids of charge
84. d/dx dipole velocity integrals
85. d/dy dipole velocity integrals
86. d/dz dipole velocity integrals
87. unmodified h matrix during SCRF or EFP
88. reserved for Ivana Adamovic
89. EFP multipole contribution to one e- Fock matrix
90. ECP coefficients
int 91. ECP labels
92. ECP coefficients
int 93. ECP labels
94. bare nucleus Hamiltonian during FFIELD runs
Programmer’s Reference 5-39
95. x dipole integrals, in AO basis
96. y dipole integrals, in AO basis
97. z dipole integrals, in AO basis
98. former coords for Schlegel geometry search
99. former gradients for Schlegel geometry search
100. not used

records 101-248 are used for NLO properties

101. U'x(0) 149. U''xx(-2w;w,w) 200. UM''xx(-


w;w,0)
102. y 150. xy 201. xy
103. z 151. xz 202. xz
104. G'x(0) 152. yy 203. yz
105. y 153. yz 204. yy
106. z 154. zz 205. yz
107. U'x(w) 155. G''xx(-2w;w,w) 206. zx
108. y 156. xy 207. zy
109. z 157. xz 208. zz
110. G'x(w) 158. yy 209. U''xx(0;w,-
w)
111. y 159. yz 210. xy
112. z 160. zz 211. xz
113. U'x(2w) 161. e''xx(-2w;w,w) 212. yz
114. y 162. xy 213. yy
115. z 163. xz 214. yz
116. G'x(2w) 164. yy 215. zx
117. y 165. yz 216. zy
118. z 166. zz 217. zz
119. U'x(3w) 167. UM''xx(-2w;w,w) 218. G''xx(0;w,-
w)
120. y 168. xy 219. xy
121. z 169. xz 220. xz
122. G'x(3w) 170. yy 221. yz
123. y 171. yz 222. yy
124. z 172. zz 223. yz
125. U''xx(0) 173. U''xx(-w;w,0) 224. zx
126. xy 174. xy 225. zy
127. xz 175. xz 226. zz
128. yy 176. yz 227. e''xx(0;w,-
w)
129. yz 177. yy 228. xy
130. zz 178. yz 229. xz
131. G''xx(0) 179. zx 230. yz
132. xy 180. zy 231. yy
133. xz 181. zz 232. yz
134. yy 182. G''xx(-w;w,0) 233. zx
135. yz 183. xy 234. zy
136. zz 184. xz 235. zz
Programmer’s Reference 5-40
137. e''xx(0) 185. yz 236.
UM''xx(0;w,-w)
138. xy 186. yy 237. xy
139. xz 187. yz 238. xz
140. yy 188. zx 239. yz
141. yz 189. zy 240. yy
142. zz 190. zz 241. yz
143. UM''xx(0) 191. e''xx(-w;w,0) 242. zx
144. xy 192. xy 243. zy
145. xz 193. xz 244. zz
146. yy 194. yz
147. yz 195. yy
148. zz 196. yz
197. zx
198. zy
199. zz

245. old NLO Fock matrix


246. older NLO Fock matrix
247. oldest NLO Fock matrix
249. polarizability derivative tensor for Raman
250. transition density matrix in AO basis
251. static polarizability tensor alpha
252. X dipole integrals in MO basis
253. Y dipole integrals in MO basis
254. Z dipole integrals in MO basis
255. alpha MO symmetry labels
256. beta MO symmetry labels
257. unused
258. Vnn gradient during MCSCF hessian
259. core Hamiltonian from der.ints during MCSCF
hessian
260-261. unused
262. MO symmetry labels during determinant CI
263. PCM nuclei/induced nuclear Charge operator
264. PCM electron/induced nuclear Charge operator
265. pristine guess alpha orbs (MOREAD or
Huckel+INSORB)
266. EFP/PCM IFR sphere information
267. fragment LMO expansions, for EFP Pauli
268. fragment Fock operators, for EFP Pauli
269-275. not used
276. Vib 0 Q matrix in FORCE
277. Vib 0 h integrals in FORCE
278. Vib 0 S integrals in FORCE
279. Vib 0 T integrals in FORCE
280. Zero field LMOs during numerical polarizability
281. Alpha zero field dens. during num. polarizability
282. Beta zero field dens. during num. polarizability
Programmer’s Reference 5-41
283. zero field Fock matrix. during num. polarizability
284. reserved for Yousung Jung
286. oriented localized molecular orbitals
287. density matrix of oriented LMOs
290-299. reserved for Alex Granovsky
301. alpha Pocc during MP2 or CIS grad (see also 361-
369)
302. alpha Pvir during MP2 gradient
303. alpha Wai during MP2 gradient
304. alpha Lagrangian Lai during MP2 or CI gradient
305. alpha Wocc during MP2 gradient
306. alpha Wvir during MP2 gradient
307. alpha P(MP2)-P(RHF) during MP2 or CIS gradient
308. alpha SCF density during MP2 or CIS gradient
309. alpha energy weighted density in MP2 or CIS grad
311. Supermolecule h during Morokuma
312. Supermolecule S during Morokuma
313. Monomer 1 orbitals during Morokuma
314. Monomer 2 orbitals during Morokuma
315. combined monomer orbitals during Morokuma
316. RHF density in CI grad, nonorthogonal MOs in SCF-
MI
317. unzeroed Fock matrix when MOs are frozen
318. MOREAD orbitals when MOs are frozen
319. bare Hamiltonian without EFP contribution
320. MCSCF active orbital density
321. MCSCF DIIS error matrix
322. MCSCF orbital rotation indices
323. Hamiltonian matrix during QUAD MCSCF
324. MO symmetry labels during MCSCF
325. final uncanonicalized MCSCF orbitals
330. CEL matrix during PCM
331. VEF matrix during PCM
332. QEFF matrix during PCM
333. ELD matrix during PCM
340. DFT alpha Fock matrix
341. DFT beta Fock matrix
342. DFT screening integrals
343. DFT: V aux basis only
344. DFT density gradient d/dx integrals
345. DFT density gradient d/dy integrals
346. DFT density gradient d/dz integrals
347. DFT M[D] alpha density resolution in aux basis
348. DFT M[D] beta density resolution in aux basis
349. DFT orbital description
350. overlap of true and auxiliary DFT basis
351. previous iteration DFT alpha density
352. previous iteration DFT beta density
353. DFT screening matrix (true and aux basis)
Programmer’s Reference 5-42
354. DFT screening integrals (aux basis only)
355. h in MO basis during DDI partial transf
361-369. same as 301-309, but for beta orbitals of UMP2.
370. left transformation for pVp
371. right transformation for pVp
370. basis A (large component) during NESC
371. basis B (small component) during NESC
372. difference basis set A-B1 during NESC
373. basis N (rel. normalized large component)
374. basis B1 (small component) during NESC
375. charges of non-relativistic atoms in NESC
376. common nuclear charges for all NESC basis
377. common coordinates for all NESC basis
378. common exponent values for all NESC basis
372. left transformation for V during RESC
373. right transformation for V during RESC
374. 2T, T is kinetic energy integrals during RESC
375. pVp integrals during RESC
376. V integrals during RESC
377. Sd, overlap eigenvalues during RESC
378. V, overlap eigenvectors during RESC
379. Lz integrals
380. reserved for Ly integrals.
381. reserved for Lx integrals.
382. X, AO orthogonalisation matrix during RESC
383. Td, eigenvalues of 2T during RESC
384. U, eigenvectors of kinetic energy during RESC
385. exponents and contraction for the original basis
int 386. shell integer arrays for the original basis
387. exponents and contraction for uncontracted basis
int 388. shell integer arrays for the uncontracted basis
389. Transformation to contracted basis
390. S integrals in the internally uncontracted basis
391. charges of non-relativistic atoms in RESC
392. copy of one e- integrals in MO basis in SO-MCQDPT
393. Density average over all $MCQD groups in SO-MCQDPT
394. overlap integrals in 128 bit precision
395. kinetic energy in 128 bit precision, for
relativity

In order to correctly pass data between different


machine types when running in parallel, it is required
that a DAF record must contain only floating point values,
or only integer values. No logical or Hollerith data may
be stored. The final calling argument to DAWRIT and
DAREAD must be 0 or 1 to indicate floating point or
integer values are involved. The records containing
integers are so marked in the list below.
Programmer’s Reference 5-43
Physical record 1 (containing the DAF directory) is
written whenever a new record is added to the file. This
is invisible to the programmer. The numbers shown above
are "logical record numbers", and are the only thing that
the programmer need be concerned with.

You might also like