Metropolis-Hastings Algorithm - Wikipedia
Metropolis-Hastings Algorithm - Wikipedia
Contents
History
Intuition
Formal derivation
Use in numerical integration
Step-by-step instructions
See also
References
Notes
Further reading
History
The algorithm was named after Nicholas Metropolis, who authored the 1953 article Equation of State
Calculations by Fast Computing Machines together with Arianna W. Rosenbluth, Marshall
Rosenbluth, Augusta H. Teller and Edward Teller. This article proposed the algorithm for the case of
symmetrical proposal distributions, and W. K. Hastings extended it to the more general case in
1970.[1]
Some controversy exists with regard to credit for development of the algorithm. Metropolis, who was
familiar with the computational aspects of the method, had coined the term "Monte Carlo" in an
earlier article with Stanisław Ulam, and led the group in the Theoretical Division that designed and
built the MANIAC I computer used in the experiments in 1952. However, prior to 2003, there was no
detailed account of the algorithm's development. Then, shortly before his death, Marshall Rosenbluth
attended a 2003 conference at LANL marking the 50th anniversary of the 1953 publication. At this
conference, Rosenbluth described the algorithm and its development in a presentation titled "Genesis
of the Monte Carlo Algorithm for Statistical Mechanics".[2] Further historical clarification is made by
Gubernatis in a 2005 journal article[3] recounting the 50th anniversary conference. Rosenbluth makes
it clear that he and his wife Arianna did the work, and that Metropolis played no role in the
development other than providing computer time.
This contradicts an account by Edward Teller, who states in his memoirs that the five authors of the
1953 article worked together for "days (and nights)".[4] In contrast, the detailed account by
Rosenbluth credits Teller with a crucial but early suggestion to "take advantage of statistical
mechanics and take ensemble averages instead of following detailed kinematics". This, says
Rosenbluth, started him thinking about the generalized Monte Carlo approach – a topic which he says
he had discussed often with John Von Neumann. Arianna Rosenbluth recounted (to Gubernatis in
2003) that Augusta Teller started the computer work, but that Arianna herself took it over and wrote
the code from scratch. In an oral history recorded shortly before his death,[5] Rosenbluth again credits
Teller with posing the original problem, himself with solving it, and Arianna with programming the
computer. In terms of reputation, there is little reason to question Rosenbluth's account. In a
biographical memoir of Rosenbluth, Freeman Dyson writes:[6]
Many times I came to Rosenbluth, asking him a question [...] and receiving an answer in
two minutes. Then it would usually take me a week of hard work to understand in detail
why Rosenbluth's answer was right. He had an amazing ability to see through a
complicated physical situation and reach the right answer by physical arguments. Enrico
Fermi was the only other physicist I have known who was equal to Rosenbluth in his
intuitive grasp of physics.
Intuition
The Metropolis–Hastings algorithm can draw samples from any probability distribution with
probability density , provided that we know a function proportional to the density and
the values of can be calculated. The requirement that must only be proportional to the
density, rather than exactly equal to it, makes the Metropolis–Hastings algorithm particularly useful,
because calculating the necessary normalization factor is often extremely difficult in practice.
The Metropolis–Hastings algorithm works by generating a sequence of sample values in such a way
that, as more and more sample values are produced, the distribution of values more closely
approximates the desired distribution. These sample values are produced iteratively, with the
distribution of the next sample being dependent only on the current sample value (thus making the
sequence of samples into a Markov chain). Specifically, at each iteration, the algorithm picks a
candidate for the next sample value based on the current sample value. Then, with some probability,
the candidate is either accepted (in which case the candidate value is used in the next iteration) or
rejected (in which case the candidate value is discarded, and current value is reused in the next
iteration)—the probability of acceptance is determined by comparing the values of the function
of the current and candidate sample values with respect to the desired distribution.
For the purpose of illustration, the Metropolis algorithm, a special case of the Metropolis–Hastings
algorithm where the proposal function is symmetric, is described below.
Let be a function that is proportional to the desired probability density function (a.k.a. a
target distribution)[a].
1. Initialization: Choose an arbitrary point to be the first observation in the sample and choose an
arbitrary probability density (sometimes written ) that suggests a candidate for
the next sample value , given the previous sample value . In this section, is assumed to be
symmetric; in other words, it must satisfy . A usual choice is to let be a
Gaussian distribution centered at , so that points closer to are more likely to be visited next,
making the sequence of samples into a random walk[b]. The function is referred to as the
proposal density or jumping distribution.
2. For each iteration t:
Generate a candidate for the next sample by picking from the distribution .
Calculate the acceptance ratio , which will be used to decide whether to
[c]
accept or reject the candidate . Because f is proportional to the density of P, we have that
.
Accept or reject:
Generate a uniform random number .
If , then accept the candidate by setting ,
If , then reject the candidate and set instead.
This algorithm proceeds by randomly attempting to move about the sample space, sometimes
accepting the moves and sometimes remaining in place. Note that the acceptance ratio indicates
how probable the new proposed sample is with respect to the current sample, according to the
distribution whose density is . If we attempt to move to a point that is more probable than the
existing point (i.e. a point in a higher-density region of corresponding to an ), we will
always accept the move. However, if we attempt to move to a less probable point, we will sometimes
reject the move, and the larger the relative drop in probability, the more likely we are to reject the new
point. Thus, we will tend to stay in (and return large numbers of samples from) high-density regions
of , while only occasionally visiting low-density regions. Intuitively, this is why this algorithm
works and returns samples that follow the desired distribution with density .
Compared with an algorithm like adaptive rejection sampling[7] that directly generates independent
samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of
disadvantages:
The samples are correlated. Even though over the long term they do correctly follow , a set
of nearby samples will be correlated with each other and not correctly reflect the distribution. This
means that effective sample sizes can be significantly lower than the number of samples actually
taken, leading to large errors.
Although the Markov chain eventually converges to the desired distribution, the initial samples
may follow a very different distribution, especially if the starting point is in a region of low density.
As a result, a burn-in period is typically necessary,[8] where an initial number of samples are
thrown away.
On the other hand, most simple rejection sampling methods suffer from the "curse of dimensionality",
where the probability of rejection increases exponentially as a function of the number of dimensions.
Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree,
and thus are often the only solutions available when the number of dimensions of the distribution to
be sampled is high. As a result, MCMC methods are often the methods of choice for producing
samples from hierarchical Bayesian models and other high-dimensional statistical models used
nowadays in many disciplines.
Formal derivation
The purpose of the Metropolis–Hastings algorithm is to generate a collection of states according to a
desired distribution . To accomplish this, the algorithm uses a Markov process, which
asymptotically reaches a unique stationary distribution such that .[11]
1. Existence of stationary distribution: there must exist a stationary distribution . A sufficient but
not necessary condition is detailed balance, which requires that each transition is
reversible: for every pair of states , the probability of being in state and transitioning to state
must be equal to the probability of being in state and transitioning to state ,
.
2. Uniqueness of stationary distribution: the stationary distribution must be unique. This is
guaranteed by ergodicity of the Markov process, which requires that every state must (1) be
aperiodic—the system does not return to the same state at fixed intervals; and (2) be positive
recurrent—the expected number of steps for returning to the same state is finite.
The Metropolis–Hastings algorithm involves designing a Markov process (by constructing transition
probabilities) that fulfills the two above conditions, such that its stationary distribution is chosen
to be . The derivation of the algorithm starts with the condition of detailed balance:
which is re-written as
The approach is to separate the transition in two sub-steps; the proposal and the acceptance-rejection.
The proposal distribution is the conditional probability of proposing a state given , and
the acceptance distribution is the probability to accept the proposed state . The transition
probability can be written as the product of them:
The next step in the derivation is to choose an acceptance ratio that fulfills the condition above. One
common choice is the Metropolis choice:
For this Metropolis acceptance ratio , either or and, either way, the
condition is satisfied.
1. Initialise
1. Pick an initial state .
2. Set .
2. Iterate
3. Accept or reject:
Provided that specified conditions are met, the empirical distribution of saved states will
approach . The number of iterations ( ) required to effectively estimate depends on the
number of factors, including the relationship between and the proposal distribution and the
desired accuracy of estimation. [12] For distribution on discrete state spaces, it has to be of the order of
the autocorrelation time of the Markov process.[13]
It is important to notice that it is not clear, in a general problem, which distribution one
should use or the number of iterations necessary for proper estimation; both are free parameters of
the method, which must be adjusted to the particular problem in hand.
For example, consider a statistic and its probability distribution , which is a marginal
distribution. Suppose that the goal is to estimate for on the tail of . Formally, can
be written as
and, thus, estimating can be accomplished by estimating the expected value of the indicator
function , which is 1 when and zero otherwise. Because is on
the tail of , the probability to draw a state with on the tail of is proportional to
, which is small by definition. The Metropolis–Hastings algorithm can be used here to sample
(rare) states more likely and thus increase the number of samples used to estimate on the tails.
This can be done e.g. by using a sampling distribution to favor those states (e.g. with
).
Step-by-step instructions
Suppose that the most recent value sampled is . To follow the Metropolis–Hastings algorithm, we
next draw a new proposal state with probability density and calculate a value
where
is the probability (e.g., Bayesian posterior) ratio between the proposed sample and the previous
sample , and
is the ratio of the proposal density in two directions (from to and conversely). This is equal to 1 if
the proposal density is symmetric. Then the new state is chosen according to the following rules.
If
else:
The Markov chain is started from an arbitrary initial value , and the algorithm is run for many
iterations until this initial state is "forgotten". These samples, which are discarded, are known as
burn-in. The remaining set of accepted values of represent a sample from the distribution .
The algorithm works best if the proposal density matches the shape of the target distribution ,
from which direct sampling is difficult, that is . If a Gaussian proposal density is
used, the variance parameter has to be tuned during the burn-in period. This is usually done by
calculating the acceptance rate, which is the fraction of proposed samples that is accepted in a
window of the last samples. The desired acceptance rate depends on the target distribution,
however it has been shown theoretically that the ideal acceptance rate for a one-dimensional Gaussian
distribution is about 50%, decreasing to about 23% for an -dimensional Gaussian target
distribution.[14] These guidelines can work well when sampling from sufficiently regular Bayesian
posteriors as they often follow a multivariate normal distribution as can be established using the
Bernstein-von Mises theorem.[15]
If is too small, the chain will mix slowly (i.e., the acceptance rate will be high, but successive
samples will move around the space slowly, and the chain will converge only slowly to ). On the
other hand, if is too large, the acceptance rate will be very low because the proposals are likely to
land in regions of much lower probability density, so will be very small, and again the chain will
converge very slowly. One typically tunes the proposal distribution so that the algorithms accepts on
the order of 30% of all samples – in line with the theoretical estimates mentioned in the previous
paragraph.
See also
Detailed balance
Genetic algorithms
Gibbs sampling
Mean-field particle methods
Metropolis-adjusted Langevin algorithm
Metropolis light transport
Multiple-try Metropolis
Parallel tempering
Preconditioned Crank–Nicolson algorithm The result of three Markov chains
Sequential Monte Carlo running on the 3D Rosenbrock
function using the Metropolis–
Simulated annealing
Hastings algorithm. The algorithm
samples from regions where the
posterior probability is high, and the
chains begin to mix in these regions.
The approximate position of the
maximum has been illuminated. The
red points are the ones that remain
after the burn-in process. The earlier
ones have been discarded.
References
1. Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their
Applications". Biometrika. 57 (1): 97–109. Bibcode:1970Bimka..57...97H (https://ui.adsabs.harvar
d.edu/abs/1970Bimka..57...97H). doi:10.1093/biomet/57.1.97 (https://doi.org/10.1093%2Fbiome
t%2F57.1.97). JSTOR 2334940 (https://www.jstor.org/stable/2334940). Zbl 0219.65008 (https://zb
math.org/?format=complete&q=an:0219.65008).
2. M.N. Rosenbluth (2003). "Genesis of the Monte Carlo Algorithm for Statistical Mechanics". AIP
Conference Proceedings. 690: 22–30. Bibcode:2003AIPC..690...22R (https://ui.adsabs.harvard.ed
u/abs/2003AIPC..690...22R). doi:10.1063/1.1632112 (https://doi.org/10.1063%2F1.1632112).
3. J.E. Gubernatis (2005). "Marshall Rosenbluth and the Metropolis Algorithm" (https://zenodo.org/re
cord/1231899). Physics of Plasmas. 12 (5): 057303. Bibcode:2005PhPl...12e7303G (https://ui.ads
abs.harvard.edu/abs/2005PhPl...12e7303G). doi:10.1063/1.1887186 (https://doi.org/10.1063%2F
1.1887186).
4. Teller, Edward. Memoirs: A Twentieth-Century Journey in Science and Politics. Perseus
Publishing, 2001, p. 328
5. Rosenbluth, Marshall. "Oral History Transcript" (https://www.aip.org/history-programs/niels-bohr-lib
rary/oral-histories/28636-1). American Institute of Physics
6. F. Dyson (2006). "Marshall N. Rosenbluth". Proceedings of the American Philosophical Society.
250: 404.
7. Gilks, W. R.; Wild, P. (1992-01-01). "Adaptive Rejection Sampling for Gibbs Sampling". Journal of
the Royal Statistical Society. Series C (Applied Statistics). 41 (2): 337–348. doi:10.2307/2347565
(https://doi.org/10.2307%2F2347565). JSTOR 2347565 (https://www.jstor.org/stable/2347565).
8. Bayesian data analysis. Gelman, Andrew (2nd ed.). Boca Raton, Fla.: Chapman & Hall / CRC.
2004. ISBN 978-1584883883. OCLC 51991499 (https://www.worldcat.org/oclc/51991499).
9. Lee, Se Yoon (2021). "Gibbs sampler and coordinate ascent variational inference: A set-
theoretical review". Communications in Statistics - Theory and Methods. 51 (6): 1549–1568.
arXiv:2008.01006 (https://arxiv.org/abs/2008.01006). doi:10.1080/03610926.2021.1921214 (http
s://doi.org/10.1080%2F03610926.2021.1921214). S2CID 220935477 (https://api.semanticscholar.
org/CorpusID:220935477).
10. Gilks, W. R.; Best, N. G.; Tan, K. K. C. (1995-01-01). "Adaptive Rejection Metropolis Sampling
within Gibbs Sampling". Journal of the Royal Statistical Society. Series C (Applied Statistics). 44
(4): 455–472. doi:10.2307/2986138 (https://doi.org/10.2307%2F2986138). JSTOR 2986138 (http
s://www.jstor.org/stable/2986138).
11. Robert, Christian; Casella, George (2004). Monte Carlo Statistical Methods (https://archive.org/det
ails/springer_10.1007-978-1-4757-4145-2). Springer. ISBN 978-0387212395.
12. Raftery, Adrian E., and Steven Lewis. "How Many Iterations in the Gibbs Sampler?" In Bayesian
Statistics 4. 1992.
13. Newman, M. E. J.; Barkema, G. T. (1999). Monte Carlo Methods in Statistical Physics. USA:
Oxford University Press. ISBN 978-0198517979.
14. Roberts, G.O.; Gelman, A.; Gilks, W.R. (1997). "Weak convergence and optimal scaling of random
walk Metropolis algorithms" (http://www.stat.columbia.edu/~gelman/research/published/theory7.p
s). Ann. Appl. Probab. 7 (1): 110–120. CiteSeerX 10.1.1.717.2582 (https://citeseerx.ist.psu.edu/vie
wdoc/summary?doi=10.1.1.717.2582). doi:10.1214/aoap/1034625254 (https://doi.org/10.1214%2F
aoap%2F1034625254).
15. Schmon, Sebastian M.; Gagnon, Philippe (2022-04-15). "Optimal scaling of random walk
Metropolis algorithms using Bayesian large-sample asymptotics" (https://www.ncbi.nlm.nih.gov/pm
c/articles/PMC8924149). Statistics and Computing. 32 (2): 28. doi:10.1007/s11222-022-10080-8
(https://doi.org/10.1007%2Fs11222-022-10080-8). ISSN 0960-3174 (https://www.worldcat.org/iss
n/0960-3174). PMC 8924149 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8924149).
Notes
a. In the original paper by Metropolis et al. (1953), was taken to be the Boltzmann distribution as
the specific application considered was Monte Carlo integration of equations of state in physical
chemistry; the extension by Hastings generalized to an arbitrary distribution .
b. In the original paper by Metropolis et al. (1953), was suggested to be a random
translation with uniform density over some prescribed range.
c. In the original paper by Metropolis et al. (1953), was actually the Boltzmann distribution, as it
was applied to physical systems in the context of statistical mechanics (e.g., a maximal-entropy
distribution of microstates for a given temperature at thermal equilibrium). Consequently, the
acceptance ratio was itself an exponential of the difference in the parameters of the numerator
and denominator of this ratio.
Further reading
Bernd A. Berg. Markov Chain Monte Carlo Simulations and Their Statistical Analysis. Singapore,
World Scientific, 2004.
Siddhartha Chib and Edward Greenberg: "Understanding the Metropolis–Hastings Algorithm".
American Statistician, 49(4), 327–335, 1995
David D. L. Minh and Do Le Minh. "Understanding the Hastings Algorithm." Communications in
Statistics - Simulation and Computation, 44:2 332-349, 2015 (http://www.tandfonline.com/doi/abs/
10.1080/03610918.2013.777455#.VOk8J1PF9_c)
Bolstad, William M. (2010) Understanding Computational Bayesian Statistics, John Wiley & Sons
ISBN 0-470-04609-0
Text is available under the Creative Commons Attribution-ShareAlike License 3.0; additional terms may apply. By using
this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia
Foundation, Inc., a non-profit organization.