A Geometric Approach To Direct Minimization
A Geometric Approach To Direct Minimization
To cite this article: TROY VAN VOORHIS & MARTIN HEAD-GORDON (2002) A geometric
approach to direct minimization, Molecular Physics, 100:11, 1713-1721, DOI:
10.1080/00268970110103642
The approach presented, geometric direct minimization (GDM), is derived from purely geo-
metrical arguments, and is designed to minimize a function of a set of orthonormal orbitals.
The optimization steps consist of sequential unitary transformations of the orbitals, and
convergence is accelerated using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) approach
in the iterative subspace, together with a diagonal approximation to the Hessian for the
remaining degrees of freedom. The approach is tested by implementing the solution of the
self-consistent field (SCF) equations and comparing results with the standard direct inversion
in the iterative subspace (DIIS) method. It is found that GDM is very robust and converges in
every system studied, including several cases in which D I E fails to find a solution. For main
group compounds, GDM convergence is nearly as rapid as DIIS, whereas for transition metal-
containing systems we find that GDM is significantly slower than DIIS. A hybrid procedure
where DIIS is used for the first several iterations and GDM is used thereafter is found to
provide a robust solution for transition metal-containing systems.
* Author for correspondence. e-mail: [email protected]. In SCF theory, the objects of fundamental interest are
edu the orthonormal orbitals
Molecular Physics ISSN 002&8976 print/ISSN 1362-3028 online 02002 Taylor & Francis Ltd
http://www.tandf.co.uk/journals
DOI: 10.1080/002689701I0103642
1714 T. Van Voorhis and M. Head-Gordon
variations in U must be skew-symmetric. Thus, in this Note that this requires one to perform a sequence of
important case, the vertical vectors are defined by their unitary transformations rather than attempting to write
ov block alone, the final set of orbital coefficients as a single unitary
transformation of the initial set. That is, the final set
of orbitals is written
c = CoedoedleA2. . . edn, (16)
The vertical tangent vectors allow us to create geode-
where each A i is a scaled gradient rather than
sics. A geodesic is the shortest path in our curved mani-
fold that connects two given points [20]. It is also the C = C o eA ,
(1717
'straightest' curve available on the surface, since addi-
and where A must be determined. Clearly the former
tional curvature would tend to lengthen the path.
approach can be viewed as a special case of the latter
Hence, on a curved surface, one discards the notion of
where the orbitals are 're-set' at each iteration. This
a straight line and replaces it with that of a geodesic. For
amounts to shifting the origin of our reference frame
SCF the geodesics depend only on the vertical vectors;
to the current set of orbitals as opposed to referencing
horizontal moves do not affect the energy, and thus any
it to the arbitrary initial orbitals.
transformation in the horizontal space will tend to unne-
cessarily Iengthen the path to the minimum. Given a
vertical vector V, it may be shown [ 141 that the geodesic 2.3. Approximate Hessian approach
initially tangent to V may be written explicitly as The use of the gradient as the step direction, as in
steepest descent, is far from optimal in practice; to
U(V) = e". (12) improve upon this step, one must utilize second-deriva-
tive information. Approximate Hessian methods accom-
Therefore, a given vertical tangent vector V defines the plish this by employing the Newton-Raphson-like step
initial direction of a unique geodesic.
If we define X = V,,V,, and Y = V,,V,, it is easily S = -BG, (18)
verified that [21] where G is the gradient and B is an approximation to the
inverse Hessian constructed from vectors collected in
U(V) =
cosx'J2 x-'J2sin x'J'v,, previous iterations. In a curved manifold, one needs to
v,,x- J2 sin X' J2 cosY'J2 be sure that the previous vectors are contained in the
current tangent space before constructing the approxi-
This allows the geodesic to be evaluated in O(N3)time, mate Hessian; otherwise the resulting step direction
with the key step being the diagonalizations of X and Y. might not be tangent to our surface, making it impos-
We can now formulate steepest descent on the Grass- sible to construct the relevant geodesic. It is at this point
mann manifold: that the formulation of 're-setting' the orbitals at each
(1) Obtain an initial set of (orthogonal) orbital co- iteration becomes useful. Specifically, since re-setting the
efficients C o . orbitals amounts to setting U = 1 at the beginning of
(2) Compute the gradient, each iteration, we have at every iteration that the tan-
gent space is of the form of equation (1 l), and therefore
= - - =dFE-d U
G = -dE dU the tangent vectors from previous iterations are guaran-
dV d U d V dV teed to be in the tangent space at the current point. If we
evaluated with the current set of orbitals. This is had chosen a fixed frame, the tangent spaces could not
a vertical vector in the tangent space. possibly be the same because this would imply that our
(3) Minimize the energy along the geodesic defined surface was flat.
by G . That is, minimize Now, the fact that the tangent spaces of subsequent
iterations are the same does not necessarily imply that
E(y) = E(eYG,a ) (14) individual vectors in these spaces may be identified;
there could be some significant internal shuffling of tan-
as a function of y. Let yo denote the optimal gent vectors between iterations that does not affect the
value of y. space they span. Rigorously, one should obtain vectors
(4) Update the orbitals using in the current frame by transporting the previous vectors
along the relevant geodesic, making sure to keep the
ci+I = CieYoG. (15)
angle between each vector and the geodesic fixed. Such
(5) If convergence has not been achieved, return to a process is called parallel transport, and we may denote
step 2. the vertical vector V after parallel transport by TV. For
1716 T. Van Voorhis and M. Head-Gordon
a Grassmann manifold, Edelman et a/.[14] showed that on the size of the system, the construction and storage of
the parallel transport of V along the geodesic generated the approximate Hessian is never prohibitive.
by A is given by Unfortunately, BFGS by itself does not accelerate
convergence enough to be competitive with DIIS. This
T V = e A V. (19) is because the orbital rotation space has many dimen-
However, the unitary transformation ed is absorbed into sions and therefore BFGS requires many iterations in
the definition of C when we re-set our orbitals and there- order to build up enough information about the Hes-
fore the parallel transported vector TV in the rotated sian. It would be ideal if we could incorporate an
frame is identical to the original vector V . We stress approximate Hessian for the degrees of freedom not
that this is a special feature of the Grassmann manifold spanned by previous iterations and simply update this
in particular, and methods with invariant subspaces in Hessian using BFGS as more and more degrees of
general. In situations with no invariant subspaces, such freedom are explored. One way to do this is to diagona-
simplifications do not occur and parallel transport must lize the 00 and w blocks of the Fock matrix. The
be explicitly accounted for explicitly. However, for the approximate Hessian can then be chosen to be diagonal
Grassmann case, setting U = 1 every iteration success- and equal to the difference in orbital eigenvalues {E} at
fully rotates the frame of reference so that the new the current iteration [8,9,23]:
tangent space in the new frame is identical to the old
Bia,jh = 2(Eu - Ei)6ijbah (22)
tangent space in the old frame, and parallel transport is
unnecessary. where i , j and a, b represent occupied and virtual orbital
One may now apply the BFGS update scheme [22] in indices, respectively. However, following the work of
+
which the approximate inverse Hessian for the (i 1)th Bacskay [l 1,121, we note that the optimal diagonal
iteration can be written as Hessian actually contains an energy shift,
BjU,jh= 2(fa - ~j + 6E)Sjj6,,. (23)
Within Bacskay's quadratic procedure, the value of this
shift is equal to the change in energy that results from a
Newton-Raphson step using the shifted Hessian. Since
where 6Gi+]is the change in the gradient vector from the clearly the energy that would be obtained after applica-
previous iteration and the intermediates tion of the Hessian clearly cannot be determined before
the Hessian has been constructed, it must be estimated
in practice. We find that the energy change from the
previous iteration is a reasonable estimator, and this
choice actually greatly improves the rate and stability
and
of convergence in the early iterations. In order to inter-
GBG = SGj+l 'Bi.bGi+I (211 face this approximate Hessian with the BFGS scheme,
we note that before any iterations have occurred the
have been used. The BFGS Hessian has two nice proper- BFGS Hessian is the unit matrix. Following [8], we
ties that make it well suited to minimization problems.
can transform to the set of coordinates where our diag-
First, it minimizes a quadratic potential with the mini-
onal Hessian (23) is also the unit matrix
mum number of gradient evaluations, and therefore is
expected to have superlinear convergence. Second, the
BFGS Hessian is positive definite. Hence, given the
choice between reducing the energy in one direction
and reducing the gradient in another, the BFGS pre- which we will call the energy-weighted coordinates
scription will tend to preferentially reduce the energy, (EWCs). The BFGS prescription (20) can then be
which clearly is desirable when dealing with a minimiza- applied in terms of these coordinates.
tion problem. To compute the EWCs (equation (24)) one must find
Storing the BFGS Hessian is not difficult, because one the orbitals that diagonalize the 00 and w blocks of the
needs only to compute it in the subspace spanned by Fock matrix: the pseudo-canonical orbitals. Since this
gradients and steps from previous iterations. This is does not affect the energy, the relevant transformation
easily accomplished by orthonormalizing the vectors can be written as a step in the horizontal space,
from previous iterations and expanding everything in
terms of these orthogonal basis vectors [8]. Since the
size of the subspace is not expected to depend strongly
C=CeH; H= (Hoe0 * ). Hvv
(25)
Geometric direct minimization 1717
The propensity of GDM to find local minima is ben- Table 2. Convergence statistics for the 27 first row transition
eficial in many cases. It often happens that the lowest metal complexes MCO', M(C0): and MCH:
energy solution for a correlated calculation such as MP2 Average Maximum Local Did not
or CCSD is not the same as for the H F solution. In these Method iterations iterations minima converge
cases, it is highly desirable to have a method that can be
induced to land consistently on this solution. Similarly, DIIS/HF 33.1 101 8 4
when one considers energies collected at several different DIIS/B3LYP 26.1 58 9 2
geometries, as might occur in a geometry optimization GDM/HF 88.6 216 13 1
GDM/B3LYP 17.5 170 7 0
or during a molecular dynamics simulation, it is usually
DIIS-GDM/HF 30.8 56 3 0
understood that one wishes to consider the same state at DIIS-GDM/B3LYP 31.8 104 3 0
all geometries, regardless of whether it is the global
minimum for the current structure. In both these
cases, GDM would seem to be the preferable alternative.
Sometimes, of course, convergence to the global mini- convergence of GDM is slowed for these molecules due
mum is exactly what is desired. In these cases, a robust to the presence of saddle points. With GDM, often one
approach can be formulated by performing DIIS extra- observes a rapid decrease in step size as a saddle point is
polation until the RMS Gradient is below, say 0.01. In approached, which means it takes very many small steps
this scheme, effectively DIIS acts as to pre-converge the to traverse a saddle point and continue with the mini-
orbitals that are input into the GDM procedure. We mization. It is not completely clear why DIIS has no
present the results for this approach under the heading such difficulty; in some cases the problem is moot
DIIS-GDM in table 1, and it is seen that the hybrid because DIIS simply converges to the saddle point,
approach retains the rapid convergence of both DIIS which clearly is undesirable, but in other cases, DIIS
and GDM for these cases, combining the robustness seems to avoid these problem areas.
of the GDM algorithm with the ability of DIIS to find For transition metal complexes, the robust conver-
the lowest energy solution. gence of GDM can be combined with DIIS's ability to
deal with saddle points using the hybrid DIIS-GDM
3.2. Transition metal complexes approach. As can be seen in table 2, DIIS-GDM con-
Transition metal systems present an interesting chal- verges at essentially the same rate as DIIS, and the
lenge for SCF convergence, since often there are extre- robustness of GDM is completely retained. Further, it
mely large numbers of low energy critical points that an is interesting to note that DIIS-GDM tends, on average,
algorithm must sort through in order to arrive at a suit- to converge to even lower energy solutions than either
able minimum. This makes it almost impossible for a DIIS or GDM individually. This would seem to result
convergence algorithm to consistently pick out the from the fact that running a few DIIS iterations effec-
global minimum from among the swarm of candidates. tively supplies GDM with an improved initial guess that
This can be aided by tailoring the initial guess to treat a lies within the basin of attraction of a different (and
battery of organometallic systems well [30], and so our more reliable) minimum.
task for these cases is focused on discovering an algor-
ithm that converges quickly and consistently to a mini- 3.3. DifJicult cases
mum, with the presumption that the convergence can be It is instructive to present in detail those cases where
shunted towards the global minimum by suitable adjust- DIIS fails to converge. The results of GDM and DIIS-
ment of the initial guess. GDM optimization for these systems are listed in table
To see how well GDM deals with the abundance of 3. These systems show the same general features that
critical points and near-degeneracies in these cases, we have been observed previously. GDM converges
have run calculations on the first-row transition metal rapidly, except for one case (B3LYP for ScCO')
carbonyl (MCO') and dicarbonyl (M(C0):) cations of where the procedure encounters a saddle point between
[311 and the first-row transition metal-methylene cations the initial guess and the final result. Running DIIS for
(MCH:) of [32]. Our calculations employed the 6-31g* the first several iterations pushes the energy below the
basis [33] and the GWH guess [29]. Since we make no offending saddle point and convergence is then much
attempt here to include relativistic corrections, we use more rapid. In other cases, the rate of convergence of
the non-relativistic optimized geometries [31,321. the hybrid procedure is comparable with GDM. These
A statistical summary of the convergence rates is pre- systems illustrate the tendency of DIIS-GDM to find
sented in table 2, showing that GDM is significantly lower energy solutions than GDM alone, as the hybrid
slower than DIIS for these cases, although it still suc- approach finds a lower solution for five of the seven
ceeds in converging in all cases where D I N fails. The cases. It is important to recognize that many of these
1720 T. Van Voorhis and M. Head-Gordon
Table 3. Converges information for molecules where DIIS one has multiple invariant subspaces. In the case of
failed to converge (energies in Eh). ROHF, one has doubly occupied, singly occupied and
GDM DIIS-GDM unoccupied subspaces that are invariant to rotations
within themselves. In the case of CASSCF [19], one
Molecule Iterations Energy Iterations Energy has inactive occupied (o), inactive virtual (v) and
CH/B3LYP 22 -38.4941 23 -38.4941 active (a) subspaces. Considering the CASSCF case,
OH/B3LYP 27 -75.7624 22 - 75.7624 the logical generalization of updating the orbitals by
NiCH$/HF 54 - 1545.2996 33 - 1545.2996 sequential ov rotations,
NiCH$/B3LYP 29 - 1547.0374 33 -1547.0561
COCO+/HF 31 -1493.6052 52 - 1493.6643
c,,, = Ce’ov, (32)
NiCO+/HF 34 -1619.0480 49 - 1619.0866 is to perform a sequence of three rotations at each step
ScCO+/B3LYP 101 -873.6381 61 -873.6831
Fe(CO)l/HF 32 -1487.3737 58 - 1487.4712 c,,, = C e A ~ ~ e A ~ a e A ~ ~ ,(33)
where the order of the rotations is arbitrary but should
be the same at each iteration. Each of these three rota-
tion classes could then be extrapolated sequentially
cases could also have been converged by using standard
using manipulations that are completely analogous to
‘tricks of the trade’ in conjunction with DIIS, for
those presented here. For example, after the ov rotation
example, by employing a level shift or damping the
has been performed, the vectors from previous iterations
DIIS iterations. However, we think the GDM approach
could be translated to the new coordinate frame using
is prefereable to these techniques because it presents a
unified solution to these problems. 3 = eAL”v,Aov, (34)
The BFGS and approximate orbital Hessians could then
4. Discussion be applied in exactly the same way, as has been done in
We have presented no estimates of the cost of this [9]. The ability to perform an efficient search without
approach relative to DIIS. GDM is designed for the having to resort to explicit caIculation of the Hessian
intermediate size regime where the cost is dominated is extremely important for optimized orbital coupled
by Fock builds rather than matrix manipulations, and cluster methods [18,7,6] where, as for SCF, the forma-
therefore such considerations are not relevant. This tion and inversion of the full Hessian is expensive.
covers the vast majority of computations done today.
For extremely large molecules, linear scaling methods
are a necessity, and density matrix-based treatments 5. Conclusions
are expected to be much more efficient due to the expo- In this work we have used basic principles of differ-
nential decay of the density for non-conducting systems. ential geometry to formulate a new approach to energy
Much of the present algorithm could be rephrased minimization for methods that depend on a set of ortho-
readily in terms of the effects of the unitary transforma- normal orbitals. The resulting approach, GDM, is com-
tion on the one-particle density matrix rather than the petitive with DIIS for many systems and capable of
orbitals. Then, the locality of the density matrix in the converging many ‘problem cases’ where DIIS fails to
atomic orbital basis could be exploited to perform find a solution. It is satisfying that this can be achieved
matrix multiplications in a linear scaling fashion [34]. by a rigorous geometric argument that does not resort to
The only obstacle to this approach is that the orbital heuristic damping factors or level shifts.
Hessian (23) requires diagonalization of the 00 and w For systems where DIIS converges, GDM shows a
blocks of the Fock matrix, and there is no well defined mild tendency to converge to higher energy solutions
prescription for diagonalizing sparse matrices in linear than DIIS. This happens primarily when the initial
time. If an alternative approximate Hessian could be guess is far from the global minimum, and it is argued
obtained that did not require pseudo-canonical orbitals, that this is a desirable characteristic in many situations.
the current algorithm could be translated very readily However, in case the GDM solution is not acceptable,
into a linear scaling density matrix-based scheme, very we also present a hybrid DIIS-GDM approach that reg-
similar in s p i r i t k t w o f [34]. ularly gives energies that are as low as the DIIS solution
There is one significant alteration that would need to but that retains the robustness of GDM. The hybrid
be made in order to apply this approach to ROHF procedure also tends to accelerate convergence signifi-
optimization or to active space correlation methods cantly in cases where GDM is slow to converge.
like CASSCF [19]. In both of these cases, instead of It is clear that the source of the robustness of these
having two invariant subspaces (occupied and virtual) procedures is the adherence to the direct minimization
Geometric direct minimization 1721
strategy, which requires the energy to go down at every [I31 SANO,T., and I’HAYA,Y. J., 1991, J. chem. Phys., 95,
iteration. Whereas an extrapolation technique such as 6607.
[14] EDELMAN, A,, ARIAS,T. A,, and SMITH,S., 1998, SZAM
DIIS can oscillate between two fixed points, this is not J . Matrix Anal. Applic., 20, 303.
allowed in a direct minimization, since one of the two [15] ROOTHAAN, C . C. J., 1951, Rev. Mod. Phys., 23, 69.
points must be higher in energy and thus one half of the [16] BOBROWICZ, F. B., and GODDARD, W. A., 1977, Methods
oscillation must involve an uphill step. of Electronic Structure Theory, Vol. 3, edited by H. F.
The GDM strategy has many potential applications Schaefer I11 (New York: Plenum Press).
[17] KRIEGER,J. B., LI, Y., and IAFRATE, G. J., 1992, Phys.
due to the prevalence of the orbital optimization prob- Rev. A, 46, 5453.
lem in electronic structure theory. One interesting direc- [18] SCUSERIA, G. E., and SCHAEFER, H. F., 1987, Chem. Phys.
tion involves the simultaneous updating of nuclear and Lett., 142, 354.
electronic degrees of freedom [8,23], which would have [19] Roos, B. O., TAYLOR, P. R., and SIEGBAHN, P. E. M.,
clear relevance for Car-Parrinello molecular dynamics 1980, Chem. Phys., 48, 157.
[20] DOCARMO, M. P., 1976, Differential Geometry of Curves
[35], where the size of the timestep is determined mainly and Surfaces (Prentice-Hall).
by the radius of reliable extrapolation of the Kohn- [21] HUTTER,J., PARRINELLO, M., and VOGEL,S., 1994, J.
Sham orbitals along the trajectory. Since the current chem. Phys., 101, 3862.
algorithm is simply the geometric generalization of a [22] The BFGS method is discussed, e.g. in the popular
trajectory in the presence of an orthogonality constraint, Numerical Recipes books available at www.nr.com.
[23] FISCHER, T. H., and AMLOF,J., 1992, J. phys. Chem.. 96,
it stands to reason that the success of this method for 9768.
single-point problems could be extended readily to deal [24] KONG,J., WHITE,C. A., KRYLOV, A. I., SHERRILL, C.
with molecular dynamics simulations. D), ADAMSON, R. D., FURLANI, T. R., LEE,M. S., LEE,
Another interesting application of these principles A. M., GWATLNEY, S. R.,ADAMS,T. R.,DASCHEL, H.,
would be to test similar approaches for active space ZHANG,W., KORAMBATH, P. P), OCHSENFELD, C.,
GILBEN,A. T. B., KEDZIORA, G. S., MAURICE, D. R.,
correlation models [6, 191 that involve only a minor NAIR, N., SHAO,Y., BESLEY,N. A., MASLEN,P. E.,
extension of the formulae presented here. Previous DOMBROSKI, J. P., BAKER,J., BYRD, E. F. C., VAN
work [9] suggests that this avenue should be fruitful. VOORHIS,T., OUMI, M., HIRATA,S., Hsu, C.-P.,
ISHIKAWA, N., FLORIAN, J., WARSHEL, A., JOHNSON, B.
This research was supported by a grant from the G., GILL,P. M. W., HEAD-GORDON, M., and POPLE,J.
A., 2000, Q-Chem 2.0: A high performance ab initio elec-
National Science Foundation (CHE-998 1997). tronic structure program package; J. comput. Chem., 21,
1532.
References [25] CURTISS, L. A., et al., 1991, J . chem. Phys., 94, 7221.
[l] PULAY, P., 1980, Chem. Phys. Lett., 73, 393. [26] KRISHNAN, R., BINKLEY, J. S., SEEGER, R.,and POPLE,J.
[2] PULAY, P., 1982, J. Cornput. Chem., 3, 556. A,, 1980, J. chem. Phys., 72, 650.
[3] HAMILTON, T. P., and PULAY,P., 1986, J. chem. Phys., [27] CLARK,T., CHANDRASEKHAR, J., and SCHLEYER, P. v.
84, 5728. R., 1983, J . Cornput. Chem., 4, 294.
[4] VAN LENTHE,J. H., VERVEEK, J., and PULAY,P., 1991, [28] DUPUIS,M., and KING, H. F., 1977, I d . J. Quantum
Molec. Phys., 73, 1159. Chem., 11, 613.
[5] MULLER, R. P., et al., 1994, J. chem. Phys., 100, 1226. [29] BECKE,A. D., 1993, J. chern. Phys., 98, 5648.
[6] KRYLOV, A. I., SHERRILL, C. D., BYRD,E. F. C., HEAD- [30] VACEK,G., PERRY,J. K., and LANGLOIS, J.-M., 1999,
GORDON,M., 1998, J . chem. Phys., 109, 10669. Chem. Phys. Lett., 310, 189.
[7] SHERRILL, C. D., KRYLOV, A. I., BYRD,E. F. C., HEAD- [31] BARNES. L. A., ROSI,M., and BAUSLICHER, JR., C. W.,
GORDON,M., 1998, J. chem. Phys., 109, 4171. 1990, J. chem. Phys., 93, 609.
[8] HEAD-GORDON, M., and POPLE,J. A., 1988, J. phys. [32] BAUSLICHER, Jr., C. W., et al., 1992, J . phys. Chem., 96,
Chem., 92, 3063. 6969.
[9] CHABAN,G., SCHMIDT,M. W., and GORDON,M. S., [33] HARIHARAN, P. C., and POPLE,J. A,, 1973, Theoret.
1997, Theoret. Chim. Acta, 97, 88. Chim. Acta, 28, 213.
[lo] CHANC~S, E., and LE BRIS, C., 2000, Intl. J. Quantum [34] HELGAKER, T., LARSEN, H., OLSEN,J., and JBRGENSEN,
Chem., 79, 82. P., 2000, Chem. Phys. Lett., 327, 397.
1111 BACSKAY, G. B., 1981, Chem. Phys., 61, 385. [35] CAR,R., and PARRINELLO, M., 1985, Phys. Rev. Lett., 55,
[12] BACSKAY, G. B., 1982, Chem. Phys., 65, 383. 2471.