LAMARC 2.0 Maximum Likelihood and Bayesian Estimation of

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Vol. 22 no.

6 2006, pages 768–770


BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btk051

Genetics and population analysis

LAMARC 2.0: maximum likelihood and Bayesian estimation of


population parameters
Mary K. Kuhner1
1
Department of Genome Sciences, Box 357730, University of Washington, Seattle, WA 98195-7730, USA
Received on November 15, 2005; revised and accepted on January 9, 2006
Advance Access publication January 12, 2006
Associate Editor: Frank Dudbridge

ABSTRACT likelihood surface for the parameters of interest. LAMARC 2.0

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Brighton on June 16, 2014


Summary: We present a Markov chain Monte Carlo coalescent retains this capability, but can also perform a Bayesian analysis
genealogy sampler, LAMARC 2.0, which estimates population genetic in which the sampler searches among parameter values as well
parameters from genetic data. LAMARC can co-estimate subpopula- as genealogies.
tion Q ¼ 4Nem, immigration rates, subpopulation exponential growth
rates and overall recombination rate, or a user-specified subset
of these parameters. It can perform either maximum-likelihood or 2 ALGORITHM
Bayesian analysis, and accomodates nucleotide sequence, SNP, 2.1 Statistical approaches
microsatellite or elecrophoretic data, with resolved or unresolved hap-
LAMARC’s maximum-likelihood estimation uses a set of driving
lotypes. It is available as portable source code and executables for
values, working values of the population parameters, to construct
all three major platforms.
an importance sampling function which will guide the search
Availability: LAMARC 2.0 is freely available at http://evolution.gs.
among genealogies. This procedure can be inefficient for finding
washington.edu/lamarc
the maximum-likelihood estimates (MLEs) of the parameter values
Contact: [email protected]
unless the driving values are close to the unknown true parameters,
so the search is iterated using the previous estimates as new driving
values.
1 INTRODUCTION Bayesian estimation searches simultaneously among genealogies
Inference of population parameters (such as effective population (guided by the current working values of the population para-
size, growth rate or immigration rate) from sequence data is often meters) and among values of the population parameters (guided
done using summary statistics in order to avoid dealing with the by the current genealogy). Most probable estimates (MPEs) and
unknown genealogy relating the sampled sequences. Such genea- credibility intervals are produced by recording the parameter
logies are difficult to infer accurately, and nearly impossible in values visited by the search and doing one-dimensional curve-
cases with recombination. smoothing to obtain the posterior probability curve for each
The Lamarc package addresses this difficulty by approximate parameter.
integration over the space of possible genealogies using Markov For both forms of analysis, LAMARC estimates parameters for
chain Monte Carlo (MCMC) sampling. This avoids both the loss each unlinked genomic region separately, as well as a joint estimate
of power from using summary statistics and the difficulty of infer- over all regions.
ring the true genealogy. Previous programs in the package include
COALESCE (Kuhner et al., 1995), estimating Q ¼ 4Nem and 2.2 Evolutionary models
several programs co-estimating Q and one additional type of para- LAMARC estimates Q ¼ 4Nem, where Ne is the effective diploid
meter: FLUCTUATE (exponential growth rate) (Kuhner et al., population size and m is the neutral mutation rate per site per
1998), MIGRATE (immigration rates) (Beerli and Felsenstein, generation. (The estimated Q can also be interpreted in a haploid,
1999), (Beerli and Felsenstein, 2001) and RECOMBINE (recomb- mitochondrial or alternative ploidy context.) It can co-estimate the
ination rate) (Kuhner and Felsenstein, 2000; Kuhner et al., 2000a). exponential growth rate g. In subdivided populations it estimates Q
When multiple evolutionary forces act on a population, analyzing and optionally g for each subpopulation, and immigration rate into
them one at a time may lead to bias and loss of power. We have each subpopulation from each of the others. Finally, it can option-
developed an integrated program, LAMARC 2.0, which can infer ally estimate the overall recombination rate r ¼ c/m, where c is the
multiple forces simultaneously for greater accuracy. recombination chance per site per generation. Customized models
Previous Lamarc package programs have performed maximum- where specific rates are omitted, held constant or forced to be equal
likelihood analysis, using the sampled genealogies to construct a to one another are possible for all parameters.

 The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access
version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University
Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its
entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected]
LAMARC 2.0

3 IMPLEMENTATION 4 DISCUSSION
3.1 Search strategy 4.1 Model assumptions
LAMARC 2.0 provides several mechanisms to improve its search LAMARC 2.0 assumes that individuals are drawn from panmictic
efficiency. Metropolis-Coupled MCMC or ‘heating’ allows auxil- subpopulations and that the subpopulation structure has been con-
liary searches with more permissive acceptance criteria to act as stant throughout the lifespan of the underlying coalescent tree. It
‘scouts’ for the main analysis (Geyer, 1991a). For likelihood is not suitable for populations which have recently diverged from
analyses, multiple replicated searches can be combined using a common ancestor. It assumes that the rate at which a lineage
reverse logistic regression (Geyer, 1991b). For Bayesian analyses, immigrates into a population is independent of the size of both
the Bayesian priors and the ratio of parameter change steps to source and recipient populations. It also assumes that exponential
genealogy change steps can be set by the user. growth rates and immigration rates have been constant throughout
the lifespan of the coalescent tree and that recombination rate
3.2 Mutational models does not vary by position, subpopulation or with time. Finally, it
LAMARC 2.0 offers the Felsenstein 84 (F84) and General Time- assumes that the variation being observed is neutral, though puri-
Reversible (GTR) models for DNA or RNA data, and for SNP data fying selection removing harmful mutations does not disrupt the

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Brighton on June 16, 2014


when information about the total sequence length surveyed is analysis much.
available. The SNP model used is correct only if all variable Violation of these assumptions will potentially result in biased
sites in the data have been captured; there will be an ascertainment estimates and inaccurate confidence intervals.
bias if SNPs were surveyed based on their presence in an external 4.2 Bayesian versus likelihood analysis
panel. Multiple substitution rate categories (including an invariant
category) and potential autocorrelation between rates at adjacent In most cases examined so far (Kuhner and Smith, manuscript
sites are accomodated using a hidden Markov model (Felsenstein submitted) LAMARC’s Bayesian and likelihood methods produce
and Churchill, 1996). similar point estimates and confidence intervals. The Bayesian
For microsatellites, four models are available: a stepwise muta- method is vulnerable to a poor choice of priors, but with good
tion model (Ohta and Kimura, 1973); a Brownian-motion approx- priors it may search among genealogies more efficiently, especially
imation to the stepwise model (Beerli and Felsenstein, 2001) which in cases where one or more parameters are close to zero. Our current
is much faster, but may be inaccurate when polymorphism is low; curve-smoothing method does not allow the Bayesian algorithm
a K-allele model and a mixture model of the stepwise and K-allele to assess correlation among parameters, whereas the likelihood
models, with the mixture parameter potentially optimized based on algorithm can. Speed requirements of the two methods are similar;
the data. The K-allele model is also suitable for analyzing elecro- the Bayesian sampler must perform more search steps, but its
phoretic data. curve-smoothing is faster than likelihood maximization.
Separate genetic regions with different forms of data (e.g. a DNA
locus and an unlinked microsatellite locus) may be combined in a 4.3 Data requirements.
single analysis. The user must provide information on the expe- LAMARC 2.0 assumes that individuals are sampled randomly
cted relative m and/or Ne of the various regions if they differ. For within each subpopulation, but it does not require equal sample
example, mitochondrial and nuclear DNA may be combined in one sizes among subpopulations.
analysis, but the program must be informed of the expected 4· If some subpopulations are not genetically differentiated (4Nem
difference in Ne. much greater than one) results will be unsatisfactory. Such sub-
populations are best pooled into a single subpopulation.
3.3 Haplotype uncertainty A sample of 20 individuals per subpopulation is fully adequate
Phase-unknown data may be used, although they are less powerful and results are often satisfactory with as few as eight, especially if
than phase-known data. The genealogy search is extended to search multiple loci are available. For estimation of any parameter except
among haplotype resolutions as well, so that the estimate takes into recombination rate, adding unlinked loci will improve the estimate
account haplotype uncertainty as well as genealogy uncertainty more than adding individuals or lengthening sequences. For estim-
(Kuhner and Felsenstein, 2000; Kuhner et al., 2000b). ating recombination rate, lengthening sequences or adding linked
loci are preferable.
3.4 Availability
LAMARC 2.0 is freely distributed as portable C++ source code and 4.4 History
as executables for Windows, Mac OSX and Linux. It provides a LAMARC 1.0 was released in 2001. LAMARC 2.0 corrects several
utility to convert PHYLIP, RECOMBINE and MIGRATE input deficiencies in the previous versions, particularly errors in likeli-
files. The file converter’s graphical user interface uses a multi- hood maximization and handling of multi-locus data. LAMARC 2.0
platform windowing system which works on all three major plat- adds Bayesian analysis, the ability to constrain parameters and
forms, but a pure text file converter is also available. The major new mutational models.
requirements for the use of LAMARC are availability of memory
and time. For example, estimation of recombination rate using
60 16 kb mtDNA sequences required 2 GB of memory and ACKNOWLEDGEMENTS
3–4 weeks of workstation time. Smaller analyses will often take The author thanks the Lamarc team, including Peter Beerli,
1–2 days. Joseph Felsenstein, Eric Rynes, Lucian Smith, Elizabeth Walkup,

769
M.K.Kuhner

Jon Yamato and Wang Yi. Development of this program was sup- Geyer,C.J. (1991a) Markov chain Monte Carlo maximum likelihood. In Keramidas
ported by NIH grant 5R01GM51929-11 to M.K. Funding to pay the (ed.), Computing Science and Statistics: Proceedings of 23rd Symposium on the
Interface, Interface Foundation, Fairfax Station, pp. 156–163.
Open Access publication charges was provided by the National
Geyer,C.J. (1991b) Estimating normalizing constants and reweighting mixtures
Institutes of health grant 5R01CM51929-11 to M.K. in Markov chain Monte Carlo. Technical Report No. 568, School of Statistics,
University of Minnesota, MN revised 1994.
Conflict of Interest: none declared. Kuhner,M.K. and Felsenstein,J. (2000) Sampling among haplotype resolutions in
a coalescent-based genealogy sampler. Genet. Epidemiol., 19 (Suppl. 1), S15–S21.
Kuhner,M.K. et al. (1995) Estimating effective population size and mutation rate from
sequence data using Metropolis-Hastings sampling. Genetics, 140, 1421–1430.
REFERENCES Kuhner,M.K. et al. (1998) Maximum likelihood estimation of population growth rates
Beerli,P. and Felsenstein,J. (1999) Maximum-likelihood estimation of migration rates based on the coalescent. Genetics, 149, 429–434.
and effective population numbers in two populations using a coalescent approach. Kuhner,M.K. et al. (2000a) Usefulness of single nucleotide polymorphism data for
Genetics, 152, 763–773. estimating population parameters. Genetics, 156, 439–447.
Beerli,P. and Felsestein,J. Maximum likelihood estimation of a migration matrix and Kuhner,M.K. et al. (2000b) Maximum likelihood estimation of recombination rates
effective population sizes in n subpopulations using a coalescent approach. Proc. from population data. Genetics, 156, 1393–1401.
Natl Acad. Sci. USA, 98, 4563–4568. Ohta,T. and Kimura,M. (1973) A model of mutation appropriate to estimate the number
Felsenstein,J. and Churchill,G.A. (1996) A hidden Markov model approach to variation of electrophoretically detectable alleles in a finite population. Genet. Res., 22,

Downloaded from http://bioinformatics.oxfordjournals.org/ at University of Brighton on June 16, 2014


among sites in rate of evolution. Mol. Biol. Evol., 13, 93–104. 201–204.

770

You might also like