An Implementation of WHAM: The Weighted Histogram Analysis Method
An Implementation of WHAM: The Weighted Histogram Analysis Method
A
2
. AMBER users
should take care when using angular restraints: the specication and output
of angles is in degrees, but AMBERs spring constants use kcal/mol-rad
2
.
The fourth argument (correl time) species the decorrelation time for
your time series, in units of time steps. It is only used when generating
fake data sets for Monte Carlo bootstrap error analysis, where it modulates
the number of points per fake data set. This argument is optional, and is
ignored if you dont do error analysis. If youre doing multiple temperatures
but not bootstrapping, set it to any integer value as a placeholder, and itll be
ignored. See section 9.2 for more discussion about how to do bootstrapping.
Finally, the last (optional) eld is the temperature for this simulation. If
not supplied, the temperature specied on the command line is used. In the
present version of the code, you must either leave the temperature unspecied
for all simulations or specify it for all simulations.
The time series les must follow one of two formats, depending on whether
the temperature was specied in the metadata le. If no temperature was
specied, the le should contain two columns, where the rst is the time
(which isnt actually used), and the second is the position of the system
8
along the reaction coordinate. Both numbers should be in oating point
format. Lines beginning with # are ignored as comments. Additional
columns of data are ignored.
If the simulation temperature is specied, there must be a third column of
data, containing the systems potential energy at that time point. It should
be a oating point value.
8.1.3 Output
The rst line of the output le contains echoes command line. The next line
or two contain comments describing the periodicity used and the number of
simulation windows present. While the calculation is running, it will print
out lines that look like the following:
#Iteration 10: 0.106019
#Iteration 20: 0.062269
#Iteration 30: 0.039890
#Iteration 40: 0.027003
This species the current iteration number, and the average change in the
F values for the current iteration. This number is not used for deciding when
the calculation has converged; rather, the maximum change, as opposed to
the average, is used.
Every 100 iterations, the current version of the PMF is dumped into the
output le. These lines look like
-178.000000 0.014212 4909.138943
-174.000000 0.062631 4525.390035
-170.000000 0.227076 3432.434076
-166.000000 0.494262 2190.487110
-162.000000 0.817734 1271.708620
The rst column is the value of the reaction coordinate, the second is the
value of the PMF, and the third is the unnormalized probability distribution.
Once the calculation has converged, wham will produce output resembling
# Dumping simulation biases, in the metadata file order
# Window F (free energy units)
# 0 0.000004
9
# 1 -4.166136
# 2 -3.241052
# 3 -4.475215
# 4 -6.324340
# 5 -7.128731
These are the nal F values from the wham calculation, and can be used
for computing weighted averages for properties other than the free energy.
You may have noticed that all of the lines except the free energies are
preceded by #. This allows you to check convergence of your wham calcu-
lation by simply plotting the output le in gnuplot. If the free energy curves
have stopped changing, your tolerance is small enough.
If you specied a nonzero number of Monte Carlo bootstrap error analysis
trials, you will see lines that resemble
#MC trial 0: 990 iterations
#MC trial 1: 973 iterations
#MC trial 2: 970 iterations
#MC trial 3: 981 iterations
#MC trial 4: 984 iterations
at the end of the le.
The free energy data le is written when the calculation converges, and
resembles:
#Coor Free +/- Prob +/-
-178.000000 0.014386 0.000098 0.106389 0.000017
-174.000000 0.068560 0.000151 0.097128 0.000025
-170.000000 0.250825 0.000350 0.071496 0.000042
-166.000000 0.523786 0.000294 0.045186 0.000022
The rst column is the value of the reaction coordinate, the second is the
free energy. The third is the statistical uncertainty of the free energy (which is
only meaningful if you performed Monte Carlo bootstrapping). The fourth
and fth columns are the probability and its associated statistical uncer-
tainty. Again, the latter is only meaningful if bootstrapping is performed.
See section 9.2 for further discussion of error estimation.
10
8.2 wham-2d
8.2.1 Command line arguments
wham-2d Px[=0|pi|val] hist_min_x hist_max_x num_bins_x \
Py[=0|pi|val] hist_min_y hist_max_y num_bins_y \
tol temperature numpad metadatafile freefile \
use_mask
The command line arguments largely have the same meaning as they do
for the one dimensional wham program.
The periodicity arguments are not optional.
Px by itself indicates that the rst dimension of the reaction coordinate
has a period of 360. Px=0 turns o periodicity. Px=pi species a period
of 2*pi, and Px=val allows you to choose an arbitrary value for the period.
hist min x, hist max x, and num bins x behave exactly like hist min,
hist max, and num bins do in the 1 dimensional program.
Py, hist min y, etc., behave the same as Px, hist min x, etc., except they
control the second coordinate of the PMF.
use mask expects an integer value, and if its values is non-zero turns on
the automasking feature, which causes bins for which there is no sample data
to be excluded from the wham calculation.
8.2.2 File formats
As with regular 1 dimensional wham, each line of the metadata le should
either be blank, begin with a #, or have the following format
/path/to/timeseries/file loc_win_x loc_win_y spring_x spring_y [correl time] [temp]
This rst eld is the name of one of the time series les. loc win x and
loc win y are the locations of the minimum of the biasing terms in the rst
and second dimensions of the reaction coordinate. spring x and spring y
are the spring constants used for the biasing potential in this simulation,
assuming the biasing potential is of the format
V =
1
2
(k
x
(x x
0
)
2
+ k
y
(y y
0
)
2
) (2)
The sixth argument (correl time) species the decorrelation time for
your time series, in units of time steps. It is only used when generating
11
fake data sets for Monte Carlo bootstrap error analysis, where it modulates
the number of points per fake data set. This argument is optional, and is
ignored if you dont do error analysis. If youre doing multiple temperatures
but not bootstrapping, set it to any integer value as a placeholder, and itll be
ignored. See section 9.2 for more discussion about how to do bootstrapping.
Finally, the last eld is the temperature this simulation was run at. If
not supplied, the temperature specied on the command line is used. In the
present version of the code, you must either leave the temperature unspecied
for all simulations or specify it for all simulations.
The time series les must follow one of two formats, depending on whether
the temperature was specied in the metadata le. If no temperature was
specied, the le should contain three columns, where the rst is the time
(which isnt actually used), and the second and third are the position of the
system along the x and y reaction coordinates, respectively. Both numbers
should be in oating point format. Lines beginning with # are ignored as
comments. Additional columns of data are ignored.
If the simulation temperature is specied, there must be a fourth column
of data, containing the systems potential energy at that time point. It should
be a oating point value. See the section on replica exchange for more details.
8.2.3 Output
The output largely resembles that for wham, except with more columns.
The rst line echoes the command line, followed by a specication of the
periodicity, and the number of windows. The iteration lines have the same
meaning. When the current value for the PMF is dumped, the format looks
like
-172.500000 -172.500000 1.968750 15.394489
-172.500000 -157.500000 2.574512 5.522757
-172.500000 -142.500000 3.147538 2.094142
-172.500000 -127.500000 3.505869 1.141952
where the rst two columns are the values of the rst and second dimensions
of the reaction coordinate, the third column is the PMF, and the last column
is the unnormalized probability.
Once the calculation has converged, wham will produce output resembling
# Dumping simulation biases, in the metadata file order
12
# Window F (free energy units)
#0 -0.000004
#1 -0.156869
#2 -0.534845
#3 -2.445469
These are the nal F values from the wham calculation, and can be used
for computing weighted averages for properties other than the free energy.
If you specied a nonzero number of Monte Carlo bootstrap error analysis
trials, you will see lines that resemble
#MC trial 0: 990 iterations
#MC trial 1: 973 iterations
#MC trial 2: 970 iterations
#MC trial 3: 981 iterations
#MC trial 4: 984 iterations
at the end of the le.
The free energy data le is written when the calculation converges, and
resembles:
-232.500000 -232.500000 4.812986 0.003185 0.000001 0.000000
-232.500000 -217.500000 4.830312 0.003741 0.000001 0.000000
-232.500000 -202.500000 4.898622 0.001009 0.000000 0.000000
The rst two columns are the locations along the rst and second dimen-
sions of the reaction coordinate. The third is the free energy, while the fourth
is the statistical uncertainty in the free energy. The fth and sixth columns
are the normalized probability and its statistical uncertainty. The two un-
certainty columns will be zero if you did not use Monte Carlo bootstrapping.
9 Discussion
9.1 Periodicity
Use of periodic boundary conditions only changes one thing in the code: when
calculating the biasing potential from a simulation window for a specic bin
in the histogram (the w
j
(X
i
) values in Equation 8 of Rouxs paper, cited
above), the minimum image convention is applied. Thus, for a window with
13
the biasing potential centered at 175 degrees, the distance to the bin at
-175 is 10 degrees, not 350 degrees.
The numpad argument on the command line is useful primarily for pe-
riodic reaction coordinates. It species a number of additional windows to
be prepended and appended to the nal output, such that the periodicity is
explicitly visible in the free energy. So, if a calculation was done using 360
degree periodicity, 36 windows, with the reaction coordinate ranging -180
to 180, and numpad=5, a total of 46 values would be output, from -225 to
+225. The numpad value has no eect at all on the values computed for the
PMF and probability.
9.2 Monte Carlo Bootstrap Error Analysis
The premise of bootstrapping error analysis is fairly straightforward. For a
time series containing N points, choose a set of N points at random, allowing
duplication. Compute the average from this fake data set. Repeat this
procedure a number of times and compute the standard deviation of the
average of the fake data sets. This standard deviation is an estimate for
the statistical uncertainty of the average computed using the real data. What
this technique really measures is the heterogeneity of the data set, relative
to the number of points present. For a large enough number of points, the
average value computed using the faked data will be very close to the value
with the real data, with the result that the standard deviation will be low. If
you have relatively few points, the deviation will be high. The technique is
quite robust, easy to implement, and correctly accounts for time correlations
in the data. Numerical Recipes has a good discussion of the basic logic of
this technique. For a more detailed discussion, see An introduction to the
bootstrap, by Efron and Tibshirani (Chapman and Hall/CRC, 1994). Please
note: bootstrapping can only characterize the data you have. If your data is
missing contributions from important regions of phase space, bootstrapping
will not help you gure this out.
In principle, the standard bootstrap technique could be applied directly
to WHAM calculations. One could generate a fake data set for each time
series, perform WHAM iterations, and repeat the calculation many times.
However, this would be inecient, since it would either involve a) generating
many time series in the le system, or b) storing the time series in memory.
Neither of these strategies is particularly satisfying, the former because it
involves generating a large number of les and the latter because it would
14
consume very large amounts of memory. My implementation of WHAM is
very memory ecient because not only does it not store the time series, it
doesnt even store the whole histogram of that time series, but rather just
the nonzero portion.
However, there is a more ecient alternative. The principle behind boot-
strapping is that youre trying to establish the various of averages calculated
with N points sampling the true distribution function, using your current N
points of data as an estimate of the true distribution. The histogram of each
time series is precisely that, an estimate of the probability distribution. So,
all we have to do is pick random numbers from the distribution dened by
that histogram. Once again, Numerical Recipes shows us how to do it: we
compute the normalized cumulant function, c(x), generate a random number
between 0 and 1 R, and solve c(x) = R for x. Thus, a single Monte Carlo
trial is computed in the following manner:
1. For each simulation window, use the computed cumulant of the his-
togram to generate a new histogram, with the same number of points.
2. Perform WHAM iterations on the set of generated histograms
3. Store the average normalized probability and free energy, and their
squares for each bin in the histogram
Theres a subtlety to how you compute uctuations in the free energy
estimates, since the potential of mean force is only dened up to a constant.
I have chosen to align the PMFs by computing them from the normalized
probabilities, which is eectively the same as setting the Boltzmann-averaged
free energies equal. This is a somewhat arbitrary choice (for example, one
could also set the unweighted averages equal), but it seems reasonable. If
you want something bulletproof, use the probabilities and their associated
uctuations, which dont have this problem.
The situation is slightly more complicated when one attempts to apply
the bootstrap procedure in two dimensions, because the cumulant is not
uniquely dened. My approach is to atten the two dimensional histogram
into a 1 dimensional distribution, and take the cumulant of that. The rest of
the procedure is the same as in the 1-D case. In release 2.0.4, the option
to do 2D bootstrapping has been commented out. Im not sure if
theres a programming problem, or implementing the better way
15
of doing the 1D case simply revealed a deeper problem, but 2D
bootstrapping is currently broken.
There is one major caveat throughout all of this analysis: thus far, we
have assumed that the correlation time in time series is shorter than the
snapshot interval. To put it another way, weve assumed that all of the
data points are statistically independent. However, this is unlikely to be the
case in a typical molecular dynamics setting, which means that the sample
size used in the Monte Carlo bootstrapping procedure is too large, which
in turn causes the bootstrapping procedure to underestimate the statistical
uncertainty.
My code deals with this by allowing you to set the correlation time for
each time series used in the analysis, in eect reducing the number of points
used in generating the fake data sets (see section refss:format). For instance,
if a time series had 1000 points, and you determined by other means that the
correlation time was 10x the time interval for the time series, then you would
set correl time to 10, and each fake data set would have 100 points instead
of 1000. If the value is unset or is greater than the number of data points,
then the full number of data points is used. Please note that the actual time
values in the time series are not used in any way in this analysis; for purposes
of specifying the correlation time, the interval between consecutive points is
always considered to be 1.
The question of how to determine the correlation time is in some sense
beyond the scope of this document. In principle, one could simply compute
the autocorrelation function for each time series; if the autocorrelation is
well approximated by a single exponential, then 2x the decay time (the time
it takes the autocorrelation to drop to 1/e) would be a good choice. If
its multiexponential, then youd use the longest time constant. However,
be careful: you really want to use the longest correlation time sampled in
the trajectory, and the uctuations of the reaction coordinate may uctuate
rapidly but still be coupled to slower modes.
It is important to note that the present version of the code uses the
correlation times only for the error analysis and not for the actual PMF
calculation. This isnt like to be an issue, as the raw PMFs arent that
sensitive to the correlation times unless they vary by factors of 10 or more.
16
9.3 Using the code for replica exchange simulations
One major application for the ability to combine simulations run at dier-
ent temperatures is the analysis of replica exchange simulations, and if the
email Ive gotten over the last couple of years is any indication, its a pretty
common one. My code can be used for replica exchange, but I should start
by admitting that it wasnt designed with it in mind, and may seem a bit
clumsy.
First, the metadata le format has changed as of the November, 2007
release of the code. If you want to specify temperatures in the metadata le,
you also have to specify the number of Monte Carlo points to use (if youre
not using bootstrapping, you can safely set this to any integer). See section
8.1.2 for details.
In order to use wham with time series collected at dierent temperatures,
the rst thing to do is to follow the instructions given in section 8.1.2 regard-
ing the format of the metadata and time series les, while setting the spring
constants to 0. Indeed, for simple circumstances involving small systems this
may be enough for you to make a successful calculation.
However, for large systems this simple approach will almost certainly
get you nothing but a bunch of NaNs in your output. If this happens, the
most likely candidate is either a overow or underow in the probability
histograms. The reason is that the temperature-sensitive version of the code
increments the histogram by exp(E/k
B
T) for each point (as opposed to
counting each point as 1). Since the potential energies for condensed-phase
molecular dynamics systems using standard force elds are typically of order
-50,000 kcal/mol, the means wed be taking the exponential of a very large
number, which is a Bad Thing numerically.
However, in many circumstances one can work around this easily, by
shifting the location of zero energy. The simplest procedure is to locate this
lowest energy in any of the trajectories, and shift all of the energies in all of
the trajectories such that the lowest (most negative) value is now zero. This
will eliminate the overows, since the largest contribution from an individual
data point will now be 1.
However, shifting the energies upward can lead to a dierent set of prob-
lems, where a given simulation appears to have no probability associated
with it, e.g. the sum of exp(E/k
B
T) for the trajectory underows and is
eectively zero. This can occur if the energies in the simulation are signi-
cantly higher than those in the lowest energy trajectories, which is expected
17
for condensed phase systems at high temperatures. Underow in itself isnt
a problem, but if that simulation is the only one which contributes to a bin
in the histogram (or more generally if all of the simulations which sample
a given bin have zero overall weight), the result will be a division by zero
causing the probability to be NaN or Inf.
Solving this problem is sometimes quite simple: reshift the energies by a
few kcal/mol, such that the lowest energy is moderately small instead of zero
(say -5 kcal/mol). If the problem is just numerical underow, a small shift
may be sucient to make the problem numerically well-behaved. However,
if the relevant portion of the histogram really is unaccessible except at high
temperature, then there may be no way to x the problem, short of running
an additional umbrella-sampled trajectory.
18