0% found this document useful (0 votes)
183 views18 pages

An Implementation of WHAM: The Weighted Histogram Analysis Method

The document describes an implementation of the Weighted Histogram Analysis Method (WHAM) for calculating free energy landscapes from simulation data. It provides release notes for recent versions, describing new features and improvements, notably around error analysis and 2D calculations. Instructions are given for installing the software and using the command line programs wham and wham-2d, including their input/output file formats and options.

Uploaded by

William Agudelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
183 views18 pages

An Implementation of WHAM: The Weighted Histogram Analysis Method

The document describes an implementation of the Weighted Histogram Analysis Method (WHAM) for calculating free energy landscapes from simulation data. It provides release notes for recent versions, describing new features and improvements, notably around error analysis and 2D calculations. Instructions are given for installing the software and using the command line programs wham and wham-2d, including their input/output file formats and options.

Uploaded by

William Agudelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

An implementation of WHAM: the

Weighted Histogram Analysis Method


Version 2.0.9
Alan Grosseld
Contents
1 Introduction 3
2 New in release 2.0.9 3
3 New in release 2.0.8 3
4 New in release 2.0.7 4
5 New in release 2.0.6 4
6 New in release 2.0.4 and 2.0.5 4
7 Installation 5
8 Command line arguments and le formats 6
8.1 wham . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
8.1.1 Command line arguments for WHAM . . . . . . . . . . 6
8.1.2 File formats . . . . . . . . . . . . . . . . . . . . . . . . 8
8.1.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
8.2 wham-2d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
8.2.1 Command line arguments . . . . . . . . . . . . . . . . 11
8.2.2 File formats . . . . . . . . . . . . . . . . . . . . . . . . 11
8.2.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
9 Discussion 13
9.1 Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
9.2 Monte Carlo Bootstrap Error Analysis . . . . . . . . . . . . . 14
1
9.3 Using the code for replica exchange simulations . . . . . . . . 17
2
1 Introduction
These programs (wham and wham-2d) implement the Weighted Histogram
Analysis Method of Kumar, et al (Multidimensional free-energy calculations
using the weighted histogram analysis method, J. Comput. Chem., 16:1339-
1350, 1995). The code generally follows the notation used by Benoit Roux
(The calculation of the potential of mean force using computer simulations,
Comput. Phys. Comm., 91:275-282, 1995). Consult these papers for the
theoretical background and justication for the method.
This code is available for download from my web page (http://membrane.
urmc.rochester.edu/content/wham/). The code doesnt change all that of-
ten, but its probably worth checking periodically. If you run into trouble us-
ing these programs, feel free to contact me (alan [email protected]),
and Ill try to help you. This code is available under the GPL and BSD li-
censes, as you prefer. The exception to this licensing is a set of routines from
Numerical Recipes, which are not mine to give away.
If you use this code as part of an original piece of research, Id appreciate
a reference or acknowledgment. Theres no publication to reference, so please
use something like:
Grosseld, A, WHAM: an implementation of the weighted histogram analy-
sis method, http://membrane.urmc.rochester.edu/content/wham/, ver-
sion XXXX
For that matter, just letting me know what youre using my code for would
be nice, although again I dont insist upon it.
Suggestions and patches are welcome.
2 New in release 2.0.9
The is a bug-x release; there was a memory allocation error in wham-2d
that caused a segfault if the number of bins in the 2nd dimension was greater
than the number of windows. Caught by Bastien Loubet from the Center for
Biomembrane Physics in Denmark.
3 New in release 2.0.8
Major performance improvements to wham-2d (roughly 100x faster). The
nal answers should be indistinguishable from previous versions. Code con-
3
tributed by Nikolay Plotnikov from Stanfords Chemistry Department.
4 New in release 2.0.7
I messed up the ability to switch units from kcal to kJ: in addition to dening
k B in the header les, I also had a dene in wham 2d.c. Caught by Yi Yao
of UNC.
5 New in release 2.0.6
The primary new feature in version 2.0.6 is the ability to exclude regions of
a 2D reaction coordinate from a calculation, which were calling masking.
This is useful in cases where the PMF is being used to sample a pathway
that is two dimensional, but where large areas of the 2D surface are either
uninteresting or physically unattainable; for example, if the two reaction
coordinates are RMSDs from 2 distinct structures, it is almost certainly
impossible to sample the region where the RMS distance to each structure
is very small. In practice, you could generally do the calculation anyway
and just accept lots of NaNs and Infs oating around in the output, but
sometimes that would fail and this way is far cleaner and robust. In this
case, I chose to implement a very simple automasking procedure, such that
any bin that has no data from any window is excluded. I suppose I could
further generalize it, so that any bin with fewer than N points is excluded,
but I havent done that yet.
Theres one other trivial but often requested feature in version 2.0.6: if
you prefer to work in SI units, Ive made it easier to switch energy units
from kcal to kJ (the units of your reaction coordinate are, as always, up to
you). Its a compile-time choice, but I gure thats ne because very few
people switch units constantly. The switch is easy before you build the
code, edit wham/wham.h and wham-2d/wham-2d.h and change which version
of the constant k B is dened.
6 New in release 2.0.4 and 2.0.5
Version 2.0.5 is a trivial patch on 2.0.4, upping the length of lines allowed
when reading les.
4
Release 2.0.4 has seen major revision to the bootstrap error analysis. Ive
known for a while that the way I computed the uncertainty in the free energy
was suboptimal, since it just extrapolated from the uncertainty in the proba-
bility using the assumption that the uctuations were gaussianly distributed.
I knew this wasnt a great assumption, but Michael Shirts showed me data
showing just how bad it was. So, in this version, were doing something dif-
ferent. For 1D wham, were directly computing the uctuations in the PMF
at each bin, aligning the pmfs such that their partition functions are all 1;
this amounts to shifting them so that their Boltzmann-weighted free energies
are the same. I think the errors make more sense now.
For 2D wham, I think doing this analysis uncovered some other aws in
how the error analysis is done, ones that to be honest Im not totally sure
how to x. For now, Im simply removing the option to do bootstrapping in
2D. I hope to put it back once I solve the problem, in which case Ill make
another release.
Many thanks to Michael Shirts, for helpful discussions and contributing
much of the code changes that went into this release.
7 Installation
Untarring wham.tgz will create a directory wham/, which in turn contains
several directories (wham/ wham-2d/ doc/ nr/).
Before you build the code, you need to choose your units for energy. If
you want kcal/mol, do nothing. If you prefer to work in kJ, you need to edit
2 les, wham/wham.h and wham-2d/wham-2d.h. In each case, near the top of
the le youll see a pair of lines that look like
#define k_B 0.001982923700 // Boltzmanns constant in kcal/mol K
//#define k_B 0.0083144621 // Boltzmanns constant kJ/mol-K
To get kJ, you simply remove the // from the second line and put it
before the rst line, and save the le. In principle, you can do just wham or
just wham-2d, but I recommend against it for your own sanity.
To build the standard 1-D wham code on a machine which has gcc, you
should
cd wham
make clean
make
5
(which will delete all object les and executables, and build the wham
executable). By default, the Makele uses the gcc compiler, but I also have
ags present for the Intel compiler, plus the native compilers for Irix and
Tru64. The latter two have not been tested recently, but ought to work,
since at various times this code has been used successfully on various avors
of linux, MacOS X, AIX, Irix, and Tru64. If you nd that you need to do
anything special to make it work on your particular system, Id appreciate it
if you could let me know so I can add to the Makele.
To build the 2-D version, say
cd wham-2d
make clean
make
Several other directories are also created (doc/, which you presumably
found because youre reading this, and nr/, which contains a couple of les
from Numerical Recipes). You dont need to do anything with these direc-
tories.
8 Command line arguments and le formats
To get a listing of the command line arguments for either wham or wham-
2d, just run the command without any arguments. Optional arguments are
included in brackets. Both programs will echo their command line into the
output le, to help you gure out what you did.
8.1 wham
8.1.1 Command line arguments for WHAM
wham [P|Ppi|Pval] hist_min hist_max num_bins tol temperature numpad \
metadatafile freefile [num_MC_trials randSeed]
The rst (optional) argument species the periodicity of the reaction co-
ordinate. For a nonperiodic reaction coordinate (a distance, for example), it
should be left out. P means that the reaction coordinate has a periodicity
of 360, appropriate for angles. Ppi species a periodicity of 2*pi, appro-
priate for angles measured in radians. Pval species periodicty of some
6
arbitrary amount, val, which should be an integer or oating point num-
ber. For example, P180.0 would be appropriate for an angle with twofold
symmetry.
hist min and hist max specify the boundaries of the histogram. As a rule,
all data points outside the range (hist min, hist max) are silently ignored.
The only exception is that if an entire trajectory is outside the range, the
program halts with an error message. The solution is to remove that le from
the metadata le. hist min and hist max should be oating point numbers.
num bins species the number of bins in the histogram, and as a result
the number of points in the nal PMF. It should be an integer.
tol is the convergence tolerance for the WHAM calculations. Specically,
the WHAM iteration is considered to be converged when no F
i
value for
any simulation window changes by more than tol on consecutive iterations.
As the program runs, it prints the average change in the F values for the
most recent iteration. Obviously, this number will be smaller than tol before
the computation converges, because convergence is triggered by the largest
change as opposed to the average.
temperature is a oating point number representing the temperature in
Kelvin at which the weighted histogram calculation is performed. This does
not have to be the temperature at which the simulations were performed (see
below for discussion).
numpad species the number of padding values that should be printed
for periodic PMFs. This number should be set to 0 for aperiodic reaction
coordinates. It doesnt actually aect the calculation in any way. Rather, it
just alters the nal printout of the free energy, to make plotting of periodic
reaction coordinates simpler. This is more important for wham-2d than
wham.
metadatale species the name of the metadata le. The format of this
le is described below.
freele is the name used for the le containing the nal PMF and proba-
bility distribution.
num MC trials and randSeed are both related to the performance of
Monte Carlo bootstrap error analysis. If these values are not supplied, error
analysis is not performed. num MC trials should be an integer specifying
the number of fake data sets which should be generated. randSeed is an
integer which controls the random number seed the value you pick should
be irrelevant, but I let the user set it primarily for debugging purposes.
7
8.1.2 File formats
Each line of the metadata le should be blank, begin with a # (marking
a comment), or have the following format:
/path/to/timeseries/file loc_win_min spring [correl time] [temperature]
This rst eld is the name of one of the time series les (more on this in
a moment). The second eld, loc win min, is the location of the minimum
of the biasing potential for this simulation, a oating point number. The
third eld, spring, is the spring constant for the biasing potential used in
this simulation, assuming the biasing potential is of the format
V =
1
2
k(x x
0
)
2
. (1)
Many simulation packages, including TINKER, AMBER, and CHARMM,
do not include the
1
2
when they specify spring constants for their restraint
terms. This is a common source of error (Id love to change my code to match
the other packages behavior, but then experienced users who dont read the
manual would get messed up). Also, the units for the spring constant must
match those for the time series. So, if your time series is a distance recorded
in

Angstroms, the spring constant must be in kcal/mol-

A
2
. AMBER users
should take care when using angular restraints: the specication and output
of angles is in degrees, but AMBERs spring constants use kcal/mol-rad
2
.
The fourth argument (correl time) species the decorrelation time for
your time series, in units of time steps. It is only used when generating
fake data sets for Monte Carlo bootstrap error analysis, where it modulates
the number of points per fake data set. This argument is optional, and is
ignored if you dont do error analysis. If youre doing multiple temperatures
but not bootstrapping, set it to any integer value as a placeholder, and itll be
ignored. See section 9.2 for more discussion about how to do bootstrapping.
Finally, the last (optional) eld is the temperature for this simulation. If
not supplied, the temperature specied on the command line is used. In the
present version of the code, you must either leave the temperature unspecied
for all simulations or specify it for all simulations.
The time series les must follow one of two formats, depending on whether
the temperature was specied in the metadata le. If no temperature was
specied, the le should contain two columns, where the rst is the time
(which isnt actually used), and the second is the position of the system
8
along the reaction coordinate. Both numbers should be in oating point
format. Lines beginning with # are ignored as comments. Additional
columns of data are ignored.
If the simulation temperature is specied, there must be a third column of
data, containing the systems potential energy at that time point. It should
be a oating point value.
8.1.3 Output
The rst line of the output le contains echoes command line. The next line
or two contain comments describing the periodicity used and the number of
simulation windows present. While the calculation is running, it will print
out lines that look like the following:
#Iteration 10: 0.106019
#Iteration 20: 0.062269
#Iteration 30: 0.039890
#Iteration 40: 0.027003
This species the current iteration number, and the average change in the
F values for the current iteration. This number is not used for deciding when
the calculation has converged; rather, the maximum change, as opposed to
the average, is used.
Every 100 iterations, the current version of the PMF is dumped into the
output le. These lines look like
-178.000000 0.014212 4909.138943
-174.000000 0.062631 4525.390035
-170.000000 0.227076 3432.434076
-166.000000 0.494262 2190.487110
-162.000000 0.817734 1271.708620
The rst column is the value of the reaction coordinate, the second is the
value of the PMF, and the third is the unnormalized probability distribution.
Once the calculation has converged, wham will produce output resembling
# Dumping simulation biases, in the metadata file order
# Window F (free energy units)
# 0 0.000004
9
# 1 -4.166136
# 2 -3.241052
# 3 -4.475215
# 4 -6.324340
# 5 -7.128731
These are the nal F values from the wham calculation, and can be used
for computing weighted averages for properties other than the free energy.
You may have noticed that all of the lines except the free energies are
preceded by #. This allows you to check convergence of your wham calcu-
lation by simply plotting the output le in gnuplot. If the free energy curves
have stopped changing, your tolerance is small enough.
If you specied a nonzero number of Monte Carlo bootstrap error analysis
trials, you will see lines that resemble
#MC trial 0: 990 iterations
#MC trial 1: 973 iterations
#MC trial 2: 970 iterations
#MC trial 3: 981 iterations
#MC trial 4: 984 iterations
at the end of the le.
The free energy data le is written when the calculation converges, and
resembles:
#Coor Free +/- Prob +/-
-178.000000 0.014386 0.000098 0.106389 0.000017
-174.000000 0.068560 0.000151 0.097128 0.000025
-170.000000 0.250825 0.000350 0.071496 0.000042
-166.000000 0.523786 0.000294 0.045186 0.000022
The rst column is the value of the reaction coordinate, the second is the
free energy. The third is the statistical uncertainty of the free energy (which is
only meaningful if you performed Monte Carlo bootstrapping). The fourth
and fth columns are the probability and its associated statistical uncer-
tainty. Again, the latter is only meaningful if bootstrapping is performed.
See section 9.2 for further discussion of error estimation.
10
8.2 wham-2d
8.2.1 Command line arguments
wham-2d Px[=0|pi|val] hist_min_x hist_max_x num_bins_x \
Py[=0|pi|val] hist_min_y hist_max_y num_bins_y \
tol temperature numpad metadatafile freefile \
use_mask
The command line arguments largely have the same meaning as they do
for the one dimensional wham program.
The periodicity arguments are not optional.
Px by itself indicates that the rst dimension of the reaction coordinate
has a period of 360. Px=0 turns o periodicity. Px=pi species a period
of 2*pi, and Px=val allows you to choose an arbitrary value for the period.
hist min x, hist max x, and num bins x behave exactly like hist min,
hist max, and num bins do in the 1 dimensional program.
Py, hist min y, etc., behave the same as Px, hist min x, etc., except they
control the second coordinate of the PMF.
use mask expects an integer value, and if its values is non-zero turns on
the automasking feature, which causes bins for which there is no sample data
to be excluded from the wham calculation.
8.2.2 File formats
As with regular 1 dimensional wham, each line of the metadata le should
either be blank, begin with a #, or have the following format
/path/to/timeseries/file loc_win_x loc_win_y spring_x spring_y [correl time] [temp]
This rst eld is the name of one of the time series les. loc win x and
loc win y are the locations of the minimum of the biasing terms in the rst
and second dimensions of the reaction coordinate. spring x and spring y
are the spring constants used for the biasing potential in this simulation,
assuming the biasing potential is of the format
V =
1
2
(k
x
(x x
0
)
2
+ k
y
(y y
0
)
2
) (2)
The sixth argument (correl time) species the decorrelation time for
your time series, in units of time steps. It is only used when generating
11
fake data sets for Monte Carlo bootstrap error analysis, where it modulates
the number of points per fake data set. This argument is optional, and is
ignored if you dont do error analysis. If youre doing multiple temperatures
but not bootstrapping, set it to any integer value as a placeholder, and itll be
ignored. See section 9.2 for more discussion about how to do bootstrapping.
Finally, the last eld is the temperature this simulation was run at. If
not supplied, the temperature specied on the command line is used. In the
present version of the code, you must either leave the temperature unspecied
for all simulations or specify it for all simulations.
The time series les must follow one of two formats, depending on whether
the temperature was specied in the metadata le. If no temperature was
specied, the le should contain three columns, where the rst is the time
(which isnt actually used), and the second and third are the position of the
system along the x and y reaction coordinates, respectively. Both numbers
should be in oating point format. Lines beginning with # are ignored as
comments. Additional columns of data are ignored.
If the simulation temperature is specied, there must be a fourth column
of data, containing the systems potential energy at that time point. It should
be a oating point value. See the section on replica exchange for more details.
8.2.3 Output
The output largely resembles that for wham, except with more columns.
The rst line echoes the command line, followed by a specication of the
periodicity, and the number of windows. The iteration lines have the same
meaning. When the current value for the PMF is dumped, the format looks
like
-172.500000 -172.500000 1.968750 15.394489
-172.500000 -157.500000 2.574512 5.522757
-172.500000 -142.500000 3.147538 2.094142
-172.500000 -127.500000 3.505869 1.141952
where the rst two columns are the values of the rst and second dimensions
of the reaction coordinate, the third column is the PMF, and the last column
is the unnormalized probability.
Once the calculation has converged, wham will produce output resembling
# Dumping simulation biases, in the metadata file order
12
# Window F (free energy units)
#0 -0.000004
#1 -0.156869
#2 -0.534845
#3 -2.445469
These are the nal F values from the wham calculation, and can be used
for computing weighted averages for properties other than the free energy.
If you specied a nonzero number of Monte Carlo bootstrap error analysis
trials, you will see lines that resemble
#MC trial 0: 990 iterations
#MC trial 1: 973 iterations
#MC trial 2: 970 iterations
#MC trial 3: 981 iterations
#MC trial 4: 984 iterations
at the end of the le.
The free energy data le is written when the calculation converges, and
resembles:
-232.500000 -232.500000 4.812986 0.003185 0.000001 0.000000
-232.500000 -217.500000 4.830312 0.003741 0.000001 0.000000
-232.500000 -202.500000 4.898622 0.001009 0.000000 0.000000
The rst two columns are the locations along the rst and second dimen-
sions of the reaction coordinate. The third is the free energy, while the fourth
is the statistical uncertainty in the free energy. The fth and sixth columns
are the normalized probability and its statistical uncertainty. The two un-
certainty columns will be zero if you did not use Monte Carlo bootstrapping.
9 Discussion
9.1 Periodicity
Use of periodic boundary conditions only changes one thing in the code: when
calculating the biasing potential from a simulation window for a specic bin
in the histogram (the w
j
(X
i
) values in Equation 8 of Rouxs paper, cited
above), the minimum image convention is applied. Thus, for a window with
13
the biasing potential centered at 175 degrees, the distance to the bin at
-175 is 10 degrees, not 350 degrees.
The numpad argument on the command line is useful primarily for pe-
riodic reaction coordinates. It species a number of additional windows to
be prepended and appended to the nal output, such that the periodicity is
explicitly visible in the free energy. So, if a calculation was done using 360
degree periodicity, 36 windows, with the reaction coordinate ranging -180
to 180, and numpad=5, a total of 46 values would be output, from -225 to
+225. The numpad value has no eect at all on the values computed for the
PMF and probability.
9.2 Monte Carlo Bootstrap Error Analysis
The premise of bootstrapping error analysis is fairly straightforward. For a
time series containing N points, choose a set of N points at random, allowing
duplication. Compute the average from this fake data set. Repeat this
procedure a number of times and compute the standard deviation of the
average of the fake data sets. This standard deviation is an estimate for
the statistical uncertainty of the average computed using the real data. What
this technique really measures is the heterogeneity of the data set, relative
to the number of points present. For a large enough number of points, the
average value computed using the faked data will be very close to the value
with the real data, with the result that the standard deviation will be low. If
you have relatively few points, the deviation will be high. The technique is
quite robust, easy to implement, and correctly accounts for time correlations
in the data. Numerical Recipes has a good discussion of the basic logic of
this technique. For a more detailed discussion, see An introduction to the
bootstrap, by Efron and Tibshirani (Chapman and Hall/CRC, 1994). Please
note: bootstrapping can only characterize the data you have. If your data is
missing contributions from important regions of phase space, bootstrapping
will not help you gure this out.
In principle, the standard bootstrap technique could be applied directly
to WHAM calculations. One could generate a fake data set for each time
series, perform WHAM iterations, and repeat the calculation many times.
However, this would be inecient, since it would either involve a) generating
many time series in the le system, or b) storing the time series in memory.
Neither of these strategies is particularly satisfying, the former because it
involves generating a large number of les and the latter because it would
14
consume very large amounts of memory. My implementation of WHAM is
very memory ecient because not only does it not store the time series, it
doesnt even store the whole histogram of that time series, but rather just
the nonzero portion.
However, there is a more ecient alternative. The principle behind boot-
strapping is that youre trying to establish the various of averages calculated
with N points sampling the true distribution function, using your current N
points of data as an estimate of the true distribution. The histogram of each
time series is precisely that, an estimate of the probability distribution. So,
all we have to do is pick random numbers from the distribution dened by
that histogram. Once again, Numerical Recipes shows us how to do it: we
compute the normalized cumulant function, c(x), generate a random number
between 0 and 1 R, and solve c(x) = R for x. Thus, a single Monte Carlo
trial is computed in the following manner:
1. For each simulation window, use the computed cumulant of the his-
togram to generate a new histogram, with the same number of points.
2. Perform WHAM iterations on the set of generated histograms
3. Store the average normalized probability and free energy, and their
squares for each bin in the histogram
Theres a subtlety to how you compute uctuations in the free energy
estimates, since the potential of mean force is only dened up to a constant.
I have chosen to align the PMFs by computing them from the normalized
probabilities, which is eectively the same as setting the Boltzmann-averaged
free energies equal. This is a somewhat arbitrary choice (for example, one
could also set the unweighted averages equal), but it seems reasonable. If
you want something bulletproof, use the probabilities and their associated
uctuations, which dont have this problem.
The situation is slightly more complicated when one attempts to apply
the bootstrap procedure in two dimensions, because the cumulant is not
uniquely dened. My approach is to atten the two dimensional histogram
into a 1 dimensional distribution, and take the cumulant of that. The rest of
the procedure is the same as in the 1-D case. In release 2.0.4, the option
to do 2D bootstrapping has been commented out. Im not sure if
theres a programming problem, or implementing the better way
15
of doing the 1D case simply revealed a deeper problem, but 2D
bootstrapping is currently broken.
There is one major caveat throughout all of this analysis: thus far, we
have assumed that the correlation time in time series is shorter than the
snapshot interval. To put it another way, weve assumed that all of the
data points are statistically independent. However, this is unlikely to be the
case in a typical molecular dynamics setting, which means that the sample
size used in the Monte Carlo bootstrapping procedure is too large, which
in turn causes the bootstrapping procedure to underestimate the statistical
uncertainty.
My code deals with this by allowing you to set the correlation time for
each time series used in the analysis, in eect reducing the number of points
used in generating the fake data sets (see section refss:format). For instance,
if a time series had 1000 points, and you determined by other means that the
correlation time was 10x the time interval for the time series, then you would
set correl time to 10, and each fake data set would have 100 points instead
of 1000. If the value is unset or is greater than the number of data points,
then the full number of data points is used. Please note that the actual time
values in the time series are not used in any way in this analysis; for purposes
of specifying the correlation time, the interval between consecutive points is
always considered to be 1.
The question of how to determine the correlation time is in some sense
beyond the scope of this document. In principle, one could simply compute
the autocorrelation function for each time series; if the autocorrelation is
well approximated by a single exponential, then 2x the decay time (the time
it takes the autocorrelation to drop to 1/e) would be a good choice. If
its multiexponential, then youd use the longest time constant. However,
be careful: you really want to use the longest correlation time sampled in
the trajectory, and the uctuations of the reaction coordinate may uctuate
rapidly but still be coupled to slower modes.
It is important to note that the present version of the code uses the
correlation times only for the error analysis and not for the actual PMF
calculation. This isnt like to be an issue, as the raw PMFs arent that
sensitive to the correlation times unless they vary by factors of 10 or more.
16
9.3 Using the code for replica exchange simulations
One major application for the ability to combine simulations run at dier-
ent temperatures is the analysis of replica exchange simulations, and if the
email Ive gotten over the last couple of years is any indication, its a pretty
common one. My code can be used for replica exchange, but I should start
by admitting that it wasnt designed with it in mind, and may seem a bit
clumsy.
First, the metadata le format has changed as of the November, 2007
release of the code. If you want to specify temperatures in the metadata le,
you also have to specify the number of Monte Carlo points to use (if youre
not using bootstrapping, you can safely set this to any integer). See section
8.1.2 for details.
In order to use wham with time series collected at dierent temperatures,
the rst thing to do is to follow the instructions given in section 8.1.2 regard-
ing the format of the metadata and time series les, while setting the spring
constants to 0. Indeed, for simple circumstances involving small systems this
may be enough for you to make a successful calculation.
However, for large systems this simple approach will almost certainly
get you nothing but a bunch of NaNs in your output. If this happens, the
most likely candidate is either a overow or underow in the probability
histograms. The reason is that the temperature-sensitive version of the code
increments the histogram by exp(E/k
B
T) for each point (as opposed to
counting each point as 1). Since the potential energies for condensed-phase
molecular dynamics systems using standard force elds are typically of order
-50,000 kcal/mol, the means wed be taking the exponential of a very large
number, which is a Bad Thing numerically.
However, in many circumstances one can work around this easily, by
shifting the location of zero energy. The simplest procedure is to locate this
lowest energy in any of the trajectories, and shift all of the energies in all of
the trajectories such that the lowest (most negative) value is now zero. This
will eliminate the overows, since the largest contribution from an individual
data point will now be 1.
However, shifting the energies upward can lead to a dierent set of prob-
lems, where a given simulation appears to have no probability associated
with it, e.g. the sum of exp(E/k
B
T) for the trajectory underows and is
eectively zero. This can occur if the energies in the simulation are signi-
cantly higher than those in the lowest energy trajectories, which is expected
17
for condensed phase systems at high temperatures. Underow in itself isnt
a problem, but if that simulation is the only one which contributes to a bin
in the histogram (or more generally if all of the simulations which sample
a given bin have zero overall weight), the result will be a division by zero
causing the probability to be NaN or Inf.
Solving this problem is sometimes quite simple: reshift the energies by a
few kcal/mol, such that the lowest energy is moderately small instead of zero
(say -5 kcal/mol). If the problem is just numerical underow, a small shift
may be sucient to make the problem numerically well-behaved. However,
if the relevant portion of the histogram really is unaccessible except at high
temperature, then there may be no way to x the problem, short of running
an additional umbrella-sampled trajectory.
18

You might also like