Arlequin 301
Arlequin 301
01
Computational and Molecular Population Genetics Lab (CMPG) Institute of Zoology University of Berne Baltzerstrasse 6 3012 Bern Switzerland E-mail : [email protected] URL: http://cmpg.unibe.ch/software/arlequin3
January 2006
Table of contents
Table of contents
ARLEQUIN ver 3.01 user manual Table of contents 1 Introduction 1.1 Why Arlequin? 1.2 Arlequin philosophy 1.3 About this manual 1.4 Data types handled by Arlequin 1.4.1 DNA sequences 1.4.2 RFLP Data 1.4.3 Microsatellite data 1.4.4 Standard data 1.4.5 Allele frequency data 1.5 Methods implemented in Arlequin 1.6 System requirements 1.7 Installing and uninstalling Arlequin 1.7.1 Installation 1.7.1.1 Arlequin 3 installation 1.7.1.2 Arlequin 3 uninstallation 1.8 List of files included in the Arlequin package 1.9 Arlequin computing limitations 1.10 How to cite Arlequin 1.11 Acknowledgements 1.12 How to get the last version of the Arlequin software? 1.13 What's new in version 3.0 compared to version 2.0 1.14 What's new in version 3.01 compared to version 3.0 1.15 Forthcoming developments 1.16 Reporting bugs and comments 1.17 Remaining problems 2 Getting started 2.1 Arlequin configuration 2.2 Preparing input files 2.2.1 Defining the Genetic Structure to be tested 2.3 Loading project files into Arlequin 2.4 Selecting analyses to be performed on your data 2.5 Creating and using Setting Files 2.6 Performing the analyses 2.7 Interrupting the computations 2.8 Consulting the results 3 Input files 3.1 Format of Arlequin input files 3.2 Project file structure 3.2.1 Profile section 3.2.2 Data section 2 3 7 7 7 7 8 9 9 9 10 10 11 12 12 12 12 12 13 14 15 15 15 15 16 17 17 17 18 18 18 20 21 22 23 23 24 24 25 25 25 25 27
Table of contents
4
27 28 29 31 32 36 38 38 39 41 41 41 41 41 43 43 43 44 45 46 48 50 50 50 51 51 52 52 52 53 54 55 56 57 59 60 62 63 64 65 67 68 73 73
3.2.2.2 Distance matrix (optional) 3.2.2.3 Samples 3.2.2.4 Genetic structure 1.1.1.4 Mantel test settings 3.3 Example of an input file 3.4 Automatically creating the outline of a project file 3.5 Conversion of data files 3.6 Arlequin batch files 4 Output files 4.1 Result file 4.2 Arlequin log file 4.3 Linkage disequilibrium result file 4.4 View your results in HTML browser 5 Examples of input files 5.1 Example of allele frequency data 5.2 Example of standard data (Genotypic data, unknown gametic phase, recessive alleles) 5.3 Example of DNA sequence data (Haplotypic) 5.4 Example of microsatellite data (Genotypic) 5.5 Example of RFLP data(Haplotypic) 5.6 Example of standard data (Genotypic data, known gametic phase) 6 Arlequin interface 6.1 Menus 6.1.1 File Menu 6.1.2 View Menu 6.1.3 Options Menu 6.1.4 Help Menu 6.2 Toolbar 6.3 Tab dialogs 6.3.1 Open project 6.3.2 Handling of unphased genotypic data 6.3.3 Arlequin Configuration 6.3.4 Project Wizard 6.3.5 Import data 6.3.6 Loaded Project 6.3.7 Batch files 6.3.8 Calculation Settings 6.3.8.1 General Settings 6.3.8.2 Diversity indices 6.3.8.3 Mismatch distribution 6.3.8.4 Haplotype inference 6.3.8.4.2 Genotypic data with unknown gametic phase 6.3.8.5 Linkage disequilibrium 6.3.8.5.1 Linkage disequilibrium between pairs of loci
6.3.8.4.1 Haplotypic data, or genotypic (diploid) data with known gametic phase67
Table of contents
5
76 77 80 80 83 85 86 87 88 89 89 89 89 89 89 89
6.3.8.5.2 Hardy-Weinberg equilibrium 6.3.8.6 Neutrality tests 6.3.8.7 Genetic structure 6.3.8.7.1 AMOVA 6.3.8.7.2 Population comparison 6.3.8.7.3 Population differentiation 6.3.8.8 Genotype assignment 6.3.8.9 Mantel test 7 Methodological outlines 7.1 Intra-population level methods 7.1.1 Standard diversity indices 7.1.1.1 Gene diversity 7.1.1.2 Number of usable loci 7.1.1.3 Number of polymorphic sites (S) 7.1.2 Molecular indices 7.1.2.1 Mean number of pairwise differences ()
7.1.2.2 Nucleotide diversity or average gene diversity over L loci (RFLP and DNA data) 90 7.1.2.3 Theta estimators 7.1.2.3.1 Theta(Hom) 7.1.2.3.2 Theta(S) 7.1.2.3.3 Theta(k) 7.1.2.3.4 Theta( ) 7.1.2.4 Mismatch distribution 7.1.2.4.1 Pure demographic expansion 7.1.2.4.2 Spatial expansion 7.1.2.5 Estimation of genetic distances between DNA sequences 7.1.2.5.1 Pairwise difference 7.1.2.5.2 Percentage difference 7.1.2.5.3 Jukes and Cantor 7.1.2.5.4 Kimura 2-parameters 7.1.2.5.5 Tamura 7.1.2.5.6 Tajima and Nei 7.1.2.5.7 Tamura and Nei 7.1.2.6 Estimation of genetic distances between RFLP haplotypes 7.1.2.6.1 Number of pairwise difference 7.1.2.6.2 Proportion of difference 7.1.2.7 Estimation of distances between Microsatellite haplotypes 7.1.2.7.1 No. of different alleles 7.1.2.7.2 Sum of squared size difference 7.1.2.8 Estimation of distances between Standard haplotypes 7.1.2.8.1 Number of pairwise differences 7.1.2.9 Minimum Spanning Network among haplotypes 7.1.3 Haplotype inference 7.1.3.1 Haplotypic data or Genotypic data with known Gametic phase 7.1.3.2 Genotypic data with unknown Gametic phase 7.1.3.2.1 EM algorithm 91 91 91 92 92 93 93 95 96 97 97 97 98 99 99 100 101 101 101 102 102 102 102 102 103 103 103 103 103
Table of contents
6
105 105 109 109
7.1.4 Linkage disequilibrium between pairs of loci 7.1.4.1 Exact test of linkage disequilibrium (haplotypic data)
7.1.4.2 Likelihood ratio test of linkage disequilibrium (genotypic data, gametic phase unknown) 111 7.1.4.3 Measures of gametic disequilibrium (haplotypic data) 7.1.5 Hardy-Weinberg equilibrium. 7.1.6 Neutrality tests. 7.1.6.1 Ewens-Watterson homozygosity test 7.1.6.2 Ewens-Watterson-Slatkin exact test 7.1.6.3 Chakraborty's test of population amalgamation 7.1.6.4 Tajima's test of selective neutrality 7.1.6.5 Fus FS test of selective neutrality 7.2 Inter-population level methods 7.2.1 Population genetic structure inferred by analysis of variance (AMOVA) 7.2.1.1 Haplotypic data, one group of populations 7.2.1.2 Haplotypic data, several groups of populations 7.2.1.3 Genotypic data, one group of populations, no within- individual level 7.2.1.5 Genotypic data, one population, within- individual level 7.2.1.6 Genotypic data, one group of populations, within- individual level 7.2.1.7 Genotypic data, several groups of populations, within- individual level 7.2.2 Minimum Spanning Network (MSN) among haplotypes 7.2.3 Locus-by-locus AMOVA 7.2.4 Population specific FST indices 7.2.5 Population pairwise genetic distances 7.2.5.1 Reynolds distance (Reynolds et al. 1983): 7.2.5.2 Slatkins linearized FST's (Slatkin 1995): 7.2.5.3 M values (M = Nm for haploid populations, M = 2Nm for diploid populations). 7.2.5.4 Neis average number of differences between populations 112 113 114 114 115 115 115 116 117 117 120 120 121 123 123 124 125 125 126 126 127 127 127 128
7.2.5.5 Relative population sizes - Divergence between populations of unequal sizes 129 7.2.6 Exact tests of population differentiation 7.2.7 Assignment of individual genotypes to populations 7.2.8 Mantel test 8 References 9 Appendix 9.1 Overview of input file keywords 130 130 131 133 139 139
Introduction
Introduction
5) Methodological outlines describing which computations are actually performed by Arlequin. Even though this manual contains the description of some theoretical aspects, it should not be considered as a textbook in basic population genetics. We strongly recommend you to consult the original references provided with the description of a given method if you are in doubt with any aspect of the analysis.
By haplotypic form we mean that genetic data can be presented under the form of haplotypes (i.e. a combination of alleles at one or more loci). This haplotypic form can result from the analyses of haploid genomes (mtDNA, Y chromosome, prokaryotes), or from diploid genomes where the gametic phase could be inferred by one way or another. Note that allelic data are treated here as a single locus haplotype.
By genotypic form, we mean that genetic data is presented under the form of diploid genotypes (i.e. a combination of pairs of alleles at one or more loci). Each genotype is entered on two separate lines, with the two alleles of each locus being on a different line. Ex1: Genotypic DNA sequence data: ACGGCATTTAAGCATGACATACGGATTGACA ACGGGATTTTAGCATGACATTCGGATAGACA Ex 2: Genotypic Microsatellite data: 63 62 24 24 32 30
The gametic phase of a multi-locus genotype may be either known or unknown. If the gametic phase is known, the genotype can be considered as made up of two well-defined haplotypes. For genotypic data with unknown gametic phase, you can consider the two
Introduction
alleles present at each locus as codominant, or you can allow for the presence of a recessive allele. This gives finally four possible forms of genetic data: Haplotypic data, Genotypic data with known gametic phase, Genotypic data with unknown gametic phase (no recessive alleles) Genotypic data with unknown gametic phase (recessive alleles).
Introduction
10
code the other alleles in terms of additional repeats as compared to this reference. If this strategy is impossible, then any other number could be used as an allelic code, but the stepwise mutation model could not be assumed for these data.
Introduction
11
Mismatch distribution
Linkage disequilibrium Hardy-Weinberg equilibrium Tajimas neutrality test (infinite site model) Fu's FS neutrality test (infinite site model) Ewens-Watterson neutrality test (infinite allele model) Chakrabortys amalgamation test (infinite allele model) Minimum Spanning Network (MSN)
Introduction
12
Short description: Comparison of population samples for their haplotypic content. All the results are then summarized in a table. Different hierarchical Analyses of Molecular Variance to evaluate the amount of population genetic structure.
1.7 Installing and uninstalling Arlequin 1.7.1 Installation 1.7.1.1 Arlequin 3 installation
1) Download Arlequin3.zip to any temporary directory. 2) Extract all files contained in Arlequin3.zip in the directory of your choice. 3) Start Arlequin by double-clicking on the file WinArl3.exe, which is the main executable file.
Introduction
13
Required by Arlequin to Files Arlequin files WinArl3.exe Arlequin main application file including graphical interface and computational routines. A file containing the description of the last custom settings defined by the user. (NOT TO BE MODIFIED BY HAND) A file containing all the computation settings selected by the user to perform some calculation with Arlequin. (NOT TO BE MODIFIED BY HAND) A file containing information about Arlequin working directory and path to working project file. (NOT TO BE MODIFIED BY HAND) A console application that can perform all computations selected by the graphical interface (for advanced users wanting to write scripts to analyse many data sets). Arlecore3.exe needs the three files Arlequin.ini, arl_run.ars and arl_run.txt to perform correctly. A file containing the list of up to the last ten projects loaded into Arlequin. (NOT TO BE MODIFIED BY HAND) ua.js and ftiens4.js contain the Java scripts that allows the browsing of the result HTML files. This script needs gif files. These gif files are used by the java scripts for graphical display in the main result html file. A dynamic link library necessary for the display of graphical components of the application Arlequin 3 user manual in pfd format A text file containing a short description of the main features of Arlequin Description run properly
Arlequin.ini
Arl_run.ars
Arl_run.txt
Arlecore3.exe
recent_pro.txt
14 gif files
Qtinf.dll
Arlequin3.pdf Readme30.txt
Introduction
14
Example files in subdirectory datafiles Amova\amovahap.arp Amova\amovahap.ars Amova\amovadis.arp Amova\amovadis.ars Amova\56hapdef.txt Amova\amovadis.dis Batch\batch_ex.arb Batch\amova1.arp Batch\amova1.ars Batch\amova2.arp Batch\amova2.ars Batch\amova1mat.dis Batch\genotsta.arp Batch\genotsta.ars Batch\microsat.arp Batch\microsat.ars Batch\missdata.arp Batch\missdata.ars Batch\phenohla.arp Batch\phenohla.ars Batch\relfreq.arp Batch\relfreq.ars Batch\indlevel.arp Batch\indlevel.ars Conversion\gene_pop1.gpp Dna\mtdna_hv1.arp Dna\mtdna_hv1.ars Dna\nucl_div.arp Dna\nucl_div.ars Disequil\hwequil.arp Disequil\hwequil.ars Disequil\ld_gen0.arp Disequil\ld_gen0.ars Disequil\ld_gen1.arp Disequil\ld_gen1.ars Disequil\ld_hap.arp Disequil\ld_hap.ars Freqncy\cohen.arp Freqncy\cohen.ars Haplfreq\hla_7pop.arp Haplfreq\hla_7pop.ars Mantel\custom_corr3mat.arp Mantel\custom_corr3mat.ars Mantel\fst_corr.arp Mantel\fst_corr.ars Mantelfst_partial_corr.arp Mantel\fst_partial_corr.ars Microsat\2popmic.arp Microsat\2popmic.ars Microsat\micdipl.arp Microsat\micdipl.ars Microsat\micdipl2.arp Microsat\micdipl2.ars Neutrtst\chak_tst.arp Neutrtst\chak_tst.ars Neutrtst\ew_watt.arp Neutrtst\ew_watt.ars Neutrtst\Fu_s_test.arp Neutrtst\Fu_s_test.ars
Introduction
15
Line length in input file is limited to 100,000 characters Interleaved format is not supported in Arlequin. This concerns haplotype definition, multilocus genotypes, and distance matrices.
1.11 Acknowledgements
This program has been made possible by Swiss NSF grants No. 32-37821-93, 32.047053.96, and 31-56755.99. Many thanks to: David Roessli, Samuel Neuenschwander, Carlo Largiadr, Pierre Berthier, Mathias Currat, Guillaume Laval, Nicolas Ray, Gerald Heckel, Sabine Fink, Pierre Berthier, Daniel Wegmann, Jean-Marc Kuffer, Yannis Michalakis, Thierry Pun, Montgomery Slatkin, David Balding, Peter Smouse, Oscar Gaggiotti, Alicia Sanchez-Mazas, Isabelle Dupanloup, Estella Poloni, Giorgio Bertorelle, Guido Barbujani, Michele Belledi, Evelyne Heyer, Erika Bucheli, Alex Widmer, Philippe Jarne, Frdrique Viard, Peter de Knijff, Peter Beerli, Matthew Hurles, Mark Stoneking, Rosalind Harding, Frank Struyf, A.J. Gharrett, Jennifer Ovenden, Steve Carr, Marc Allard, Omar Chassin, Alonso Santos, John Novembre, Nelson Fagundes, Eric Minch, Pierre Darlu, Jrme Goudet, Franois Balloux, Eric Petit, Ettore Randi, Natacha Mesquita, David Foltz, Guoqing Lu, Tomas Hrbek, Corinne Zeroual, Rod Norman, Chew-Kiat Heng, Russell Pfau, April Harlin, S Kark, Jenny Ovenden, Jill Shanahan, and all the other users or beta-testers of Arlequin that have send us their comments.
Introduction
16
1) Correction of many small bugs (see below) 2) Incorporation of two new methods to estimate gametic phase and haplotype frequencies a) EM zipper algorithm: An extension of the EM algorithm allowing one to handle a larger number of polymorphic sites than the plain EM algorithm. b) ELB algorithm: a pseudo-Bayesian approach to specifically estimate gametic phase in recombining sequences. 3) Incorporation of a least-square approach to estimate the parameters of an instantaneous spatial expansion from DNA sequence diversity within samples, and computations of bootstrap confidence intervals using coalescent simulations. 4) Estimation of confidence intervals for F-statistics, using a bootstrap approach when genetic data on more than 8 loci are available. 5) Update of the java-script routines in the output html files, making them fully compatible modern web browsers like Firefox 1.0 and above. 6) A completely rewritten project reading and parsing procedure, giving more precise information on the location of potential syntax and format mistakes in the input files. 7) No need to define a web browser for consulting the results. Arlequin will automatically present the results in your default web browser (we recommend the use of Firefox freely available on http://www.mozilla.org/products/firefox/central.html.
Introduction
17
New editor of genetic structure allowing one to modify the current Genetic Structure on the fly (see section Defining the Genetic Structure to be tested 2.2.1).
Computation of population specific FST indices, when a single group is defined in the Genetic Structure. This may be useful to recognize population contributing particularly to the global FST measure. This is also available in the locus-by-locus AMOVA section (see section Population specific FST indices 7.2.4).
Getting started
18
2 GETTING STARTED
The first thing to do before running Arlequin for the first time is certainly to read the present manual . It will provide you with most of the information you are looking for. So, take some time to read it before you seriously start analyzing your data.
Before a first use of Arlequin, you need to specify which text editor will be used by Arlequin to edit project files or view the log file. We recommend the use of a powerful text editor like TextPad, freely available on http://www.textpad.com.
Getting started
19
There are two ways to create Arlequin projects: 1) You can start from scratch and use a text editor to define your data using reserved keywords. 2) You can let Arlequins create the outline of a project by selecting the tab panel Project Wizard (see section Project Wizard 6.3.4).
The controls on this tab panel allow you to specify the type of project outline that should be build. Use the Browse button to choose a name and a hard disk location for the project. Once all the settings have been chosen, the project outline is created by pressing the "Create Project" button. Note that it is not automatically loaded into Arlequin. The name of the data file should have a "*.arp" extension (for ARlequin Project). You can then edit the project by pressing the Edit Project button. Note that this wizard only creates an outline and that you manually need to fill in the data, and specify your genetic structure.
Getting started
20
A new Genetic Structure Editor has been implemented in version 3.01. In the left pane, all population samples found in the opened project are listed in the right column, with a corresponding group identifier in the left column. If no Genetic Structure is defined, the "0" identifier will be listed. In the right pane, the resulting structure is shown. Population samples can be assigned to different groups by giving them a new group identifier, like:
Getting started
21
By pressing on the Update Project, this new Structure will be added in the project file, a backup-copy of the old project will be created (with the extension *.arp.bak), and the new revised project will be reloaded into Arlequin.
A dialog box should open to allow the selection of an existing project you want to work on, like
Getting started
22
The Arlequin project files must have the *.arp extension. If your project file is valid, its main properties will be shown in the Project tab.
Getting started
23
You can navigate in the tree on the left side to select different types of computations you whish the set up. Depending on your selection, the right part of the tab dialog is will show you different parameters to set up.
Getting started
24
If an error occurs during the execution, Arlequin will write diagnostic information in a log file. If the error is not too severe, Arlequin will open the web browser where you can consult the log file. If there is a memory error, Arlequin will shut down itself. In the latter case, you should consult the Arlequin log file before launching a new analysis in order to get some information on where or at which stage of the execution the problem occurred. To do that, just reopen your last project, and press on the View Log File button on the ToolBar above. In any case, the file Arlequin_log.txt is located in the project results directory.
Note that by pressing the Stop button you have no guarantee that the current computations give correct results. For very large project files, you may have to wait for a few seconds before the calculations are stopped.
Input files
25
Input files
26
Possible values: Any integer number between 1 and 1000. Example: NbSamples =3 The type of data to be analyzed. Only one type of data is allowed per project Notation: DataType = Possible values: DNA, RFLP, MICROSAT, STANDARD and FREQUENCY Example: DataType = DNA If the current project deals with haplotypic or genotypic data Notation: GenotypicData = Possible values: 0 (haplotypic data), 1 (genotypic data) Example: GenotypicData = 0 One can also optionally specify The character used to separate the alleles at different loci (the locus separator) Notation: LocusSeparator = Possible values: WHITESPACE, TAB, NONE, or any character other than "#", or the character specifying missing data. Example: LocusSeparator = TAB Default value: WHITESPACE If the gametic phase of genotypes is known Notation: GameticPhase = Possible values: 0 (gametic phase not known), 1 (known gametic phase) Example: GameticPhase = 1 Default value: 1 If the genotypic data present a recessive allele Notation: RecessiveData = Possible values: 0 (co-dominant data), 1 (recessive data) Example: RecessiveData =1 Default value: 0 The code for the recessive allele Notation: RecessiveAllele = Possible values: Any string of characters within double quotes. This string can be explicitly used in the input file to indicate the occurrence of a recessive homozygote at one or several loci. Example: RecessiveAllele ="xxx" Default value: "null" The character used to code for missing data
Input files
27
Possible values: A character used to specify the code for missing data, entered between single or double quotes. Example: MissingData ='$' Default value: '?' If haplotype or phenotype frequencies are entered as absolute or relative values Notation: Frequency = Possible values: ABS (absolute values), REL (relative values: absolute values will be found by multiplying the relative frequencies by the sample sizes) Example: Frequency = ABS Default value: ABS The number of significant digits for haplotype frequency outputs Notation: FrequencyThreshold = Possible values: A real number between 1e-2 and 1e-7 Example: FrequencyThreshold = 0.00001 Default value: 1e-5 The convergence criterion for the EM algorithm used to estimate haplotype frequencies and linkage disequilibrium from genotypic data Notation: EpsilonValue = Possible values: A real number between 1e-7 and 1e-12. Example: EpsilonValue = 1e-10 Default value: 1e-7
Input files
28
then use this identifier in the sample data section. This way Arlequin will know exactly the DNA sequences associated to each haplotype. However, this section is optional. The haplotypes can be fully defined in the sample data section. An identifier and a combination of alleles at different loci (one or more) describe a given haplotype. The locus separator defined in the profile section must separate each adjacent allele from each other. It is also possible to have the definition of the haplotypes in an external file. Use the keyword EXTERN followed by the name of the file containing the definition of the haplotypes. Read Example 2 to see how to proceed. If the file "hapl_file.hap" contains exactly what is between the braces of Example 1, the two haplotype lists are equivalent. Example 1: [[HaplotypeDefinition]] #start the section of Haplotype definition HaplListName="list1" #give any name you whish to this list HaplList={ h1 A T #on each line, the name of the haplotype is h2 G C # followed by its definition. h3 A G h4 A A h5 G G } Example 2: [[HaplotypeDefinition]] #start the section of Haplotype definition HaplListName="list1" #give any name you whish to this list HaplList = EXTERN "hapl_file.hap"
Input files
29
It is also possible to have the definition of the distance matrix given in an external file. Use the keyword EXTERN followed by the name of the file containing the definition of the matrix. Read Example 2 to see how to proceed. Example 1: [[DistanceMatrix]] #start the distance matrix definition section MatrixName= "none" # name of the distance matrix MatrixSize= 4 # size = number of lines of the distance matrix MatrixData={ h1 h2 h3 h4 # labels of the distance matrix (identifier of the 0.00000 # haplotypes) 2.00000 0.00000 1.00000 2.00000 0.00000 1.00000 2.00000 1.00000 0.00000 } Example2: [[DistanceMatrix]] #start the distance matrix definition section MatrixName= "none" # name of the distance matrix MatrixSize= 4 # size = number of lines of the distance matrix MatrixData= EXTERN "mat_file.dis"
3.2.2.3 Samples
In this obligatory sub-section, one defines the haplotypic or genotypic content of the different samples to be analyzed. Each sample definition begins by the keyword SampleName and ends after a SampleData has been defined. One must specify: A name for each sample Notation: SampleName = Possible values: Any string of characters within quotes. Example: SampleName= "A first example of a sample name" Note: This name will be used in the Structure sub-section to identify the different samples, which are part of a given genetic structure to test. The size of the sample Notation: SampleSize = Possible values: Any integer value. Example: SampleSize=732 Note: For haplotypic data, the sample size is equal to the haploid sample size. For genotypic data, the sample size should be equal to the number of diploid individuals present in the sample. When absolute frequencies are entered, the size of each sample will be checked against the sum of all haplotypic
Input files
30
frequencies will check. If a discrepancy is found, a Warning message is issued in the log file, and the sample size is set to the sum of haplotype frequencies. When relative frequencies are specified, no such check is possible, and the sample size is used to convert relative frequencies to absolute frequencies. The data itself Notation: SampleData = Possible values: A list of haplotypes or genotypes and their frequencies as found in the sample, entered within braces Example: SampleData={ id1 1 ACGGTGTCGA id2 2 ACGGTGTCAG id3 8 ACGGTGCCAA id4 10 ACAGTGTCAA id5 1 GCGGTGTCAA } Note: The last closing brace marks the end of the sample definition. A new sample definition begins with another keyword SampleName. FREQUENCY data type: If the data type is set to FREQUENCY, one must only specify for each haplotype its identifier (a string of characters without blanks) and its sample frequency (either relative or absolute). In this case the haplotype should not be defined. Example: SampleData={ id1 1 id2 2 id3 8 id4 10 id5 1 } Haplotypic data For all data types except FREQUENCY, one must specify for each haplotype its identifier and its sample frequency. If no haplotype list has been defined earlier, one must also define here the allelic content of the haplotype. The haplotype identifier is used to establish a link between the haplotype and its allelic content maintained in a local database. Once a haplotype has been defined, it needs not be defined again. However the allelic content of the same haplotype can also be defined several times. The different definitions of haplotypes with same identifier are checked for equality. If they are found
Input files
31
identical, a warning is issued is the log file. If they are found to be different at some loci, an error is issued and the program stops, asking you to correct the error. For complex haplotypes like very long DNA sequences, one can perfectly assign different identifiers to all sequences (each having thus an absolute frequency of 1), even if some sequences turn out to be similar to each other. If the option Infer Haplotypes from Distance Matrix is checked in the General Settings dialog box, Arlequin will check whether haplotypes are effectively different or not. This is a good precaution when one tests the selective neutrality of the sample using Ewens-Watterson or Chakraborty's tests, because these tests are based on the observed number of effectively different haplotypes.
Genotypic data For each genotype, one must specify its identifier, its sample frequency, and its allelic content. Genotypic data can be entered either as a list of individuals, all having an absolute frequency of 1, or as a list of genotypes with different sample frequencies. During the computations, Arlequin will compare all genotypes to all others and recompute the genotype frequencies. The allelic content of a genotype is entered on two separate lines in the form of two pseudo-haplotypes. Examples: 1): Id1 2 2) my_id 4 0 0 1 1 0 1 0 1 0 0 1 1 ACTCGGGTTCGCGCGC ACTCGGGCTCACGCGC # the first pseudo-haplotype # the second pseudo-haplotype
If the gametic phase is supposed to be known, the pseudo-haplotypes are treated as truly defined haplotypes. If the gametic phase is not supposed to be known, only the allelic content of each locus is supposed to be known. In this case an equivalent definition of the upper phenotype would have been: my_id 4 0 1 1 0 0 1 0 0 0 1 1 1
Input files
32
Possible values: Any string of characters within quotes. Example: StructureName= "A first example of a genetic structure" Note: This name will be used to refer to the tested structure in the output files. The number of groups defined in the structure Notation: NbGroups = Possible values: Any integer value. Example: NbGroups = 5 Note: If this value does not correspond to the number of defined groups, then calculations will not be possible, and an error message will be displayed. The group definitions Notation: Group = Possible values: A list containing the names of the samples belonging to the group, entered within braces. Repeat this for as many groups you have in your structure. It is of course not allowed to put the same population in different groups. Also note that a comment sign (#) is not allowed after the opening brace and would lead to an error message. Comments about the group should therefore be done before the definition of the group. Example ( NbGroups=2 ) : Group ={ population1 population2 population3 } Group ={ population4 population5 } A new genetic Structure Editor is now available to help you with the process of defining the genetic Structure to be tested (see section Defining the Genetic Structure to be tested 2.2.1).
Input files
33
custom matrix entered into the project by the user. X1 (and X2) have to be defined in the project. This subsection starts with the keyword [[Mantel]]. The matrices, which are used to test correlation between genetic distances and one or two other distance matrices, are defined in this section. One must specify: The size of the matrices used for the Mantel test. Notation: MatrixSize= Possible values: Any positive integer value. Example: MatrixSize= 5 The number of matrices among which we compute the correlations. If this number is 2 the correlation coefficient between the YMatrix (see next keyword) and the matrix defined after the DistMatMantel keyword. If this number is 3 the partial correlation between the YMatrix (see next keyword) and the two other matrices are computed. In this case the Mantel section should contain two DistMatMantel keywords followed by the definition of a distance matrix. Notation: MatrixNumber= Example: MatrixNumber= 2 The matrix that is used as genetic distance. If the value is fst then the correlation between the population pairwise FST matrix other another matrix is computed. . If the value is custom then the correlation between a project defined matrix and other matrix is computed Notation: YMatrix=
Possible values: "fst" "log_fst" "slatkinlinearfst" "log_slatkinlinearfst" "nm" "custom" Corresponding YMatrix Y=Fst Y=log(Fst) Y=Fst/(1-Fst) Y=log(Fst/(1-Fst)) Y=(1-Fst)/(2 Fst) Y= user-specified in the project
Example: YMatrix = fst Labels that identify the columns of the YMatrix. In case of YMatrix = fst the labels should be the names of population from witch we use the pairwise FST distances. In case of YMatrix = custom the labels can be chosen by the user. These labels will be used to select the sub-matrices on which correlation (or partial correlation) is computed.
Input files
34
Possible values: A list containing the names of the label name belonging to the group, entered within braces. Example: YMatrixLabels = { "Population1 " "Population4" "Population2" "Population8" "Population5" } A keyword that allows to define a matrix with witch the correlation with the YMatrix is computed. Notation: DistMatMantel = Example: DistMatMantel={ 0.00 3.20 0.00 0.47 0.76 0.00 0.00 1.23 0.37 0.00 0.22 0.37 0.21 0.38 0.00 } Labels defining the sub-matrix on witch the correlation is computed. Notation: UsedYMatrixLabels= Possible values: A list containing the names of the label name belonging to the group, entered within braces. Example: UsedYMatrixLabels={ "Population1 " "Population5" "Population8" } Note: If you want to compute the correlation between entirely user-specified matrices, you need to list a dummy population sample in the [[Sample]] section, in order to allow for a proper reading of the Arlequin project. We hope to remove this weird limitation, but it is the way it works for now ! Two complete examples: Example 1: We compute the partial correlation between the YMatrix and two other matrices X1 and X2. The YMatrix will be the pairwise FST matrix between the population listed after YMatrixLabels . The partial correlations will be based on the 3 by 3 matrix whose labels are listed after UsedYMatrixLabels.
Input files
35
[[Mantel]] #size of the distance matrix: MatrixSize= 5 #number of declared matrixes: MatrixNumber=3 #what to be taken as the YMatrix YMatrix="Fst" #Labels to identify matrix entry and Population YMatrixLabels ={ "pop 1" "pop 2" "pop 3" "pop 4" "pop 5" } # distance matrix: X1 DistMatMantel={ 0.00 1.20 0.00 0.17 0.84 0.00 0.00 1.23 0.23 0.00 0.12 0.44 0.21 0.12 0.00 } # distance matrix: X2 DistMatMantel={ 0.00 3.20 0.00 0.47 0.76 0.00 0.00 1.23 0.37 0.00 0.22 0.37 0.21 0.38 0.00 } UsedYMatrixLabels ={ "pop 1" "pop 3" "pop 4" } Example 2: we compute the correlation between the YMatrix and another matrix X1. The YMatrix will be defined after the keyword YMatrix. The correlation will be based on the 3 by 3 matrix whose labels are listed after UsedYMatrixLabels. [[Mantel]] #size of the distance matrix: MatrixSize= 5 #number of declared matrixes: 1 or 2 MatrixNumber=2 #what to be taken as YMatrix YMatrix="Custom" #Labels to identify matrix entry and Population YMatrixLabels ={ "1" "2" "3" "4" "5" }
Input files
36
#This will be the Ymatrix DistMatMantel={ 0.00 1.20 0.00 1.17 0.84 0.00 1.00 1.23 0.23 0.00 2.12 0.44 0.21 0.12 0.00 } #This will be X1 DistMatMantel={ 0.00 3.20 0.00 2.23 1.73 0.00 2.55 2.23 0.35 0.00 2.23 1.62 1.54 2.32 0.00 } UsedYMatrixLabels ={ "1" "2" "3" "4" "5" }
[Profile] Title="Fake HLA data" NbSamples=4 GenotypicData=1 GameticPhase=0 DataType=STANDARD LocusSeparator=WHITESPACE MissingData='?' [Data] [[Samples]] SampleName="A sample of 6 Algerians" SampleSize=6 SampleData={ 1 1 1104 0200 0700 0301 3 3 0302 0200 1310 0402 4 2 0402 0602 1502 0602 } SampleName="A sample of 11 Bulgarians" SampleSize=11 SampleData={ 1 1 1103 0301 0301 0200 2 4 1101 0301 0700 0200
Input files
37
} SampleName="A sample of 12 Egyptians" SampleSize=12 SampleData={ 1 2 1104 0301 1600 0502 3 1 1303 0301 1101 0502 4 3 1502 0601 1500 0602 6 1 1101 0301 1101 0301 8 4 1302 0502 1101 0609 9 1 1500 0302 0402 0602 } SampleName="A sample of 8 French" SampleSize=8 SampleData={ 219 1 0301 0200 0101 0501 239 2 0301 0200 0301 0200 249 1 1302 0604 1500 0602 250 3 1401 0503 1301 0603 254 1 1302 0604 } [[Structure]] StructureName="My population structure" NbGroups=2 Group={ "A sample of 6 Algerians" "A sample of 12 Egyptians" } Group={ "A sample of 11 Bulgarians" "A sample of 8 French" }
Input files
38
See section Project Wizard (6.3.4) for more information on how to setup up the different parameters.
Input files
39
The translation procedure is more fully described in the Project Wizard section 6.3.5. These conversion routines were done on the basis of the description of the input file format found in the user manuals of each of aforementioned programs. The tests done with the example files given with these programs worked fine. However, the original reading procedures of the other software packages may be more tolerant than our own, and some data may be impossible to convert. Thus, some small corrections will need to be done by hand, and we apologize for that.
Input files
40
On the left tree pane you can see project files listed in the batch file. Settings choice: You can either use the same options for all project files by selecting Use interface settings, or use the setting file associated with each project file by selecting Use associated settings. In the first case, the same analyses will be performed on all project files listed in the batch file. In the second case, you can perform different computations on each project file listed in the batch file, giving you much more flexibility on what should be done. However, it implies that setting files have been prepared previously, recording the analyses needing to be performed on the data, as well as the options of these analyses. Results to summarize: Some results can be collected from the analysis of each batch file, and put into summary files. See section Batch files 6.3.7 for additional information. If the associated project file does not exist, the current settings are used. Note that the batch file, the project files, and the setting files should all be in the same folder.
Output files
41
4 OUTPUT FILES
The result files are all output in a special sub-directory, having the same name as your project, but with the ".res" extension. This has been done to structure your result files according to different projects. For instance, if your project file is called my_file.arp, then the result files will be in a sub-directory called [my_file.res]
Output files
42
1) The left pane contains a tree where each first level branch corresponds to a run. For each run we have several entries corresponding to the settings used for the calculation, the inter-population analyses (Genetic structure, shared haplotypes, etc) and finally all intra-population analyses with one entry per population sample. The description of this tree is stored in [project name]_tree.html. At this point it is important to notice that this tree uses the java script files ftiens4.js and ua.js located in Arlequins installation directory. If you move Arlequin to another location, or uninstall Arlequin, the left pane will not work anymore. 2) The right pane, shows the results concerning the selected item in the left pane. The HTML code of this pane is in the main result file. This file is located in result subdirectory of your project and is named [project name].htm. The following figure illustrates how results are presented in your HTML browser.
Methodological outlines
43
5.2 Example of standard data (Genotypic data, unknown gametic phase, recessive alleles)
In this example, the individual genotypes for 5 HLA loci are output on two separate lines. We specify that the gametic phase between loci is unknown, and that the data has a recessive allele. We explicitly define it to be "xxx". Note that with recessive data, all single locus homozygotes are also considered as potential heterozygotes with a null allele. We also provide Arlequin with the minimum frequency for the estimated haplotypes to be listed (0.00001), and we define the minimum epsilon value (sum of haplotype frequency differences between two steps of the EM algorithm) to be reached for the EM algorithm to stop when estimating haplotype frequencies. [Profile] Title="Genotypic Data, Phase Unknown, 5 HLA loci" NbSamples=1 GenotypicData=1 DataType=STANDARD
Methodological outlines
44
LocusSeparator=WHITESPACE MissingData='?' GameticPhase=0 RecessiveData=1 RecessiveAllele="xxx" [Data] [[Samples]] SampleName="Population 1" SampleSize=63 SampleData={ MAN0102 12 A33 Cw10 A33 Cw10 MAN0103 22 A33 Cw10 A33 Cw10 MAN0108 23 A23 Cw6 A29 Cw7 MAN0109 6 A30 Cw4 A68 Cw4 }
Methodological outlines
45
[Profile] Title="A small example of microsatellite data" NbSamples=4 GenotypicData=1 #Unknown gametic phase between the 2 loci GameticPhase=0 DataType=MICROSAT LocusSeparator=WHITESPACE [Data] [[Samples]] SampleName="MICR1" SampleSize=28 SampleData= { Genot1 27 12 23 17 13 22 16 Genot2 1 15 22 16 13 22 16 } SampleName="MICR2" SampleSize=59
Methodological outlines
46
12 12 15 13 14 14
24 22 20 22 22 23
18 16 18 18 16 16
12 13 12 13 10 12
21 22 20 23 22 22
16 15 16 16 15 15
13 13 12 13
24 23 24 23
16 17 16 16
} [[Structure]] StructureName="Test microsat structure" NbGroups=2 #The first group is made up of the first 2 samples Group={ "MICR1" "MICR2" } #The last 2 samples will be put into the second group Group={ "MICR3" "MICR4" }
Methodological outlines
47
#We tell Arlequin to compute Euclidian square distances between #the haplotypes listed below MissingData='?' [Data] [[HaplotypeDefinition]] HaplListName="A fictive list of RFLP haplotypes" HaplList= {} [[Samples]] #1 SampleName="pop 1" SampleSize=28 SampleData= { 1 27 40 1
}
Methodological outlines
48
StructureName="A single group of 3 samples" NbGroups=1 Group={ "pop 1" "pop 2" "pop 3" }
Methodological outlines
49
Methodological outlines
50
6 ARLEQUIN INTERFACE
The interface of Arlequin ver. 3.0 has been completely rewritten in C++ and looks like:
The graphical interface is made up of a series of tabbed dialog boxes, whose content vary dynamically depending on the type of data currently analyzed.
Methodological outlines
51
Project information Settings View Project View Results View Log file Show button text
Open tab dialog with information on current project Open specific tab dialogs to active some computations and choose their associated settings View current project in text editor View computation result in default web browser View log file in text editor Toggle presence/absence of text associated to toolbar buttons
Append results
If checked, Add results of a new analysis at the end of the current result file. Otherwise, previous results are deleted before adding the new results.
Check this box if you want Arlequin to automatically load the settings associated to each project. If this box is unchecked, the same settings will be used for different projects (see section 6.3.2).
Keep Amova null distributions Prompt for handling unphased multi-locus data
If checked, the nulle distribution of variance compoents are written in specific files (see section 6.3.2). If checked, you will have the option of estimating the gametic phase of unphased genotype data with the ELB algorithm (see section 6.3.8.4.2.1).
Methodological outlines
52
The menu to get access to the Help File System Arlequin PDF Help file Open Arlequin help file. Actually it tries to open the file "arlequin.pdf". You thus need to have installed the Adobe Acrobat extensions in your web browser. Arlequin web site About Arlequin Link to Arlequin web site http://cmpg.unibe.ch/software/arlequin3" Some information about Arlequin, its authors, contact address and the Swiss NSF grants that supported its development.
6.2 Toolbar
Arlequins toolbar contains icons that are shortcuts to some commonly used menu items as shown below. Clicking on one of these icons is equivalent to activating the corresponding menu item.
Methodological outlines
53
[i] : parameter to be set in the dialog box as an integer. [b] : check box (two states: checked or unchecked). [m] : multiple selection radio buttons. [l] : List box, allowing the selection of an item in a downward scrolling list. [r] : read only setting, cannot be changed by the user.
In this dialog box, you can locate an existing Arlequin project on your hard disk. Alternatively you can use the File | Recent Projects menu to reload one the last 10 projects on which you worked on.
Methodological outlines
54
If the menu "Prompt for handling unphased multi-locus data" is checked in the Option menu (see section 6.1.3), this dialog box will appear when projects containing genotypic data with unknown phase are loaded. The two options appearing in the dialog box are self-explanatory, and the settings for the ELB algorithm are described in the Settings for the ELB algorithm and ELB algorithm sections ( 6.3.8.4.2.1 and 7.1.3.2.3). If you choose to estimate the gametic phase with the ELB algorithm, then Arlequin project files (as many as the variable No. of files to generate in the distribution defined above) are written in a subdirectory of the result directory called PhaseDistribution. They have the name ELB_EstimatedPhase#<Sample number>.arp. Arlequin also outputs a file called ELB_Best_Phases.arp containing for each individual the gametic phases estimated with the ELB algorithm, as well as batch file ELB_PhaseDistribution.arb listing all aforementioned project files. The file ELB_Best_Phases.arp can then be analyzed as if gametic phases were known for the different samples. Keep however in mind that the gametic phases are not necessarily correct, and that analyses assuming that the gametic phase is unknown will not take into account possible gametic phase estimation errors.
Methodological outlines
55
Different options can be specified in this tab dialog. Use associated settings: By checking the Use associated settings checkbox, the settings and options last specified for your project will be used when opening a project file. When closing a project file, Arlequin automatically saves the current calculation settings for that particular project. Check this box if you want Arlequin to automatically load the settings associated to each project. If this box is unchecked, the same settings will be used for different projects. Append results: If the option Append Results is checked, the results of the current computations are appended to those of previous analyses. Otherwise, only the results of the last analysis are written in the result file, and previous results are erased. Keep AMOVA null distributions: If this option is checked, the null distributions of
2 2 2 2 a , b , c , and d
the same name as the project file, but with the extensions .va, .vb, .vc, and .vd, respectively. Helper programs:
Methodological outlines
56
Text editor: press on the Browse button to locate the text editor you want to use to edit or view your project file and to view the Arlequin Log File.
In order to help you setting up quickly a project file, Arlequin can create the outline of a project file for you. This tab dialog should allow you to quickly define which type of data you have and some of its properties. Browse button It allows you to specify the name and the directory location of the new project file. Pressing on that buttons opens a File dialog box. The project file should have the extension .arp. Create project button Press on that button once you have specified all other properties of the project. Edit project button This button become active once you have created an outline and allows you to begin editing the outline and fill in some data. Data type
Methodological outlines
57
Specify which type of data you want to analyze (DNA, RFLP, Microsat, Standard, or Frequency). Specify if the data is under genotypic or haplotypic form. Specify if the gametic phase is known (for genotypic data only). Specify if there are recessive alleles (for genotypic data only) Controls Specify the number of population samples defined in the project Choose a locus separator Specify the character coding for missing data Optional sections Specify if you want to include a global list of haplotypes Specify if you want to include a predefined distance matrix Specify if you want to include a group structure
Methodological outlines
58
With this dialog box you can quickly translate data into several other file formats often us in population genetics analyses. The currently supported formats are: Arlequin Mega ver. 1.0 GenePop ver. 1.0 Biosys ver.1.0 Phylip ver. 3.5 Win Amova ver. 1.55
The translation procedure is as follows: 1) Select the source file with the upper left Browse button. 2) Select the format of the source data file, as well as that of the target file. 3) A default extension depending on the data format is automatically given to the target file. 4) The file conversion is launched by pressing on Translate button. 5) In some cases, you might be asked for some additional information, for instance if input data is split into several input files (like in WinAmova). 6) If you have selected the translation of a data file into the Arlequin file format, you'll have the option to load the newly created project file into the Arlequin Java Interface.
Methodological outlines
59
Once a project has been loaded, the Project tab dialog becomes active. It shows a brief outline of the project in an explorable tree pane, and a few information on the data type. The project can be edited by pressing the View Project button on the Toolbar, which will launch the text editor currently specified in the Arlequin Configuration tab. All the information shown under the project profile section is read only. In order to modify them, you need to edit the project file with your text editor and reload the project with the File | Recent projects menu. File name [r]: The location and the name of the current project. Project title[r]: The title of the project as entered in the input file. Ploidy [r]: Specifies whether input data consist of diploid genotypic data or haplotypic data. For genotypic data, the diploid information of each genotype is entered on separate lines in the input file.
Methodological outlines
60
Gametic phase [r]: Specifies whether the gametic phase is known or unknown when the input file is made up of genotypic data. If the gametic phase is known, then the treatment of the data will be essentially similar to that of haplotypic data. Data type [r]: Data type specified in the input file. Dominance [r]: Specifies if the data consists of only co-dominant data or if some recessive alleles can occur. Recessive allele [r]: Specifies the identifier of the recessive allele. Locus separator[r]: The character used to separate allelic information at adjacent loci. Missing data[r]: The character used to represent missing data at any locus. By default, a question mark (?) is used for unknown alleles.
The project files found in the selected batch file appear listed in the left pane window. Use associated settings [b].: Use this button if you have prepared settings files associated to each project.
Methodological outlines
61
Use interface settings [b] : Use this button if you want to use the same predefined calculation settings for all project files. Results to summarize: This option allows you to collect a summary of the results for each file found in the batch list. These results are written in different files, having the extension *.sum. These summary files will be placed into the same directory as the batch file. List of summary files created by activating different checkboxes Checkbox Gene diversity Nucleotide composition Molecular diversity Mismatch distribution Theta values Linkage disequilibrium mold_div.sum mismatch.sum theta.sum l_d_pro.sum link_dis.sum Hardy Weinberg Tajimas test Fus Fs test Ewens Watterson Chakrabortys test Population comparisons NM_value.sum slatkin.sum tau_uneq.sum pairdiff.sum pairdist.sum hw.sum tajima.sum fu_fs.sum ewens.sum chakra.sum coanst_c.sum Molecular diversity indexes of each sample Mismatch distribution for each sample Different theta values for each sample Significance level of linkage disequilibrium for each pair of loci Number of significantly linked loci per locus Test of departure from Hardy-Weinberg equilibrium Tajimas test of selective neutrality Fus FS test of selective neutrality Ewens-Watterson tests of selective neutrality Chakrabortys test of population amalgamation Matrix of Reynolds genetic distances (in linear form) Matrix of Nm values between pairs of populations (in linear form) Matrix of Slatkins genetic distance (in linear form) Matrix of divergence times between populations, taking into account unequal population sizes (in linear form) Matrix of mean number of pairwise differences between pairs of samples (in linear form) Different genetic distances for each pair of population (only clearly readable if 2 samples in the project) List allele frequencies for all populations in turn. It becomes difficult to read when more than a single population is present in te project file. Summary file gen_div.sum nucl_comp.sum Description Gene diversity of each sample Nucleotide composition of each sample
Allele frequencies
allele_freqs.sum
Methodological outlines
62
The Settings tab is divided into two zones: On the left, a tree structure allows the user to quickly select which task to perform. The options for those tasks (settings) will appear on the right pane of the tab dialog. If you select the first Arlequin settings node on the tree, a list of the different tasks that can be set up appears on the right pane. Clicking on these underlined blue links will lead you to the appropriate settings panes. If a particular task has been selected, it will be reflected by a red dot on the left side of the task in the tree structure. Settings management Three buttons are also shown on the upper left of the tab dialog: Reset: Reset all settings to default values and uncheck all tasks. Load: Load a particular set of settings previously saved into a settings file (extension ".ars"). Save: Saves the current settings into a given setting file (extension ".ars").
Methodological outlines
63
Project file [r]: The name of the project file containing the data to be analyzed (it usually has the ".arp" extension). Result files: The html file containing the results of the analyses generated by Arlequin (it has the same name as the project file, but the ".htm" extension). Polymorphism control: Allowed missing level per site [f]: Specify the fraction of missing data allowed for any locus to be taken into account in the analyses. For instance, a level of 0.05 means that a locus with more than 5% of missing data will not be considered in any analysis. This option is especially useful when dealing with DNA data where different individuals have been sequenced for slightly different fragments. Setting a level of zero will force the analysis to consider only those sites that have been sequenced in all individuals. Alternatively, choosing a level of one means that all sites will be considered in the analyses, even if they have not been sequenced in any individual (not a very smart choice, however).
Methodological outlines
64
Transversion weight [f]: The weight given to transversions when comparing DNA sequences. Transition weight [f]: The weight given to transitions when comparing DNA sequences. Deletion weight [f]: The weight given to deletions when comparing DNA or RFLP sequences. Haplotype definition Use original definition [m]: Haplotypes are identified according to their original identifier, without considering the fact that their molecular definition could be identical. Infer from distance matrix [m] Similar haplotypes will be identified by computing a distance matrix based on the settings chosen above. When this option is activated, a search for shared haplotypes is automatically performed at the beginning of each run, and new haplotypes definitions and frequencies are computed for each population.
Methodological outlines
65
Standard diversity indices [b]: Compute several common indices of diversity, like the number of alleles, the number of segregating loci, the heterozygosity level, etc. (see section 7.1.1).
Molecular diversity indices [b]: Check box for computing several indices of diversity at the molecular level. Compute minimum spanning tree among haplotypes [b]: Computes a minimum spanning tree and a minimum spanning network among the haplotypes found in each population sample (see section 7.1.2.9). This option is only valid for haplotypic data. Molecular distance [l]: Choose the type of distance used when comparing haplotypes (see section 7.1.2.5 and below). o Gamma a value [f]: Set the value for the shape parameter of the gamma function, when selecting a distance allowing for unequal mutation rates among sites. This option is only valid for some distances computed between DNA sequences. Note that a value of zero deactivates here the Gamma correction of these distances, whereas in reality, a value of infinity would deactivate the Gamma correction procedure. This option is only valid for DNA data. Print distance matrix between haplotypes [b]: If checked, the interhaplotypic distance matrix used to evaluate the molecular diversity is printed in the result file. Theta(Hom) [b]: An estimation of
segregating site S (see section 7.1.2.3.2). Theta(k) [b]: An estimation of obtained from the observed number of
Methodological outlines
66
Estimate parameters of demographic expansion [b]: The parameters of an instantaneous demographic expansion are estimated from the mismatch distribution. (see section 7.1.2.4) using a generalized least-square approach, as described in Schneider and Excoffier (1999) (see section7.1.2.4.1). Estimate parameters of spatial expansion [b]: Estimate the specific parameters of spatial expansion, following Excoffier (2004). (see section 7.1.2.4.2). Molecular distance [l]: Here we only allow one genetic distance: the mere number of observed differences between haplotypes. Number of bootstrap replicates [l]: The number of coalescent simulations performed using the estimated parameters of the demographic or spatial expansion. These parameters will be re-estimated for each simulation in order to obtain their empirical confidence intervals, and the empirical distribution of the output statistics such as the sum of squared deviations between the observed and the expected mismatch, the raggedness index, or percentile values for each point of the expected mismatch (see section 7.1.2.4). Hundreds to thousands of simulations are necessary to obtain meaningful estimates.
Methodological outlines
67
Search for shared haplotypes [b]: Look for haplotypes that are effectively similar after computing pairwise genetic distances according to the distance calculation settings in the General Settings section. For each pair of populations, the shared haplotypes will be printed out. Then will follow a table that contains, for every group of identified haplotypes, its absolute and relative frequency in each population. This task is only possible for haplotypic data or genotypic data with known gametic phase. Haplotype definition: Use original definition [m]: Haplotypes are identified according to their original identifier, without considering the fact that their molecular definition could be identical.
Methodological outlines
68
Infer from distance matrix [m]: Similar haplotypes will be identified by computing a molecular distance matrix between haplotypes. Haplotype frequency estimation: Estimate haplotype frequencies by mere counting [b]: Estimate the maximum-likelihood haplotype frequencies from the observed data using a mere gene counting procedure. Estimate allele frequencies at all loci: Estimate allele frequencies at all loci separately. 6.3.8.4.2 Genotypic data with unknown gametic phase When gametic phase is unknown, two methods can be used to infer haplotypes: The (maximum-likelihood) EM algorithm or or the (Bayesian) ELB algorithm.
Methodological outlines
69
6.3.8.4.2.1 Settings for the ELB algorithm The ELB algorithm has been described recently in Excoffier et.al (2003).
Use ELB algorithm to estimate gametic phase [b]: Check this box if you want to estimate the gametic phase of multi-locu genotypes with the ELB algorithm. See methodological section on ELB algorithm (7.1.3.2.3) for a description of the algorithm. Dirichlet prior alpha value [f]: Value of the alpha parameter of the prior dirichlet distribution of haplotype frequencies. Recommended value: a small value like 0.01 for all data types has been found to work well (Excoffier et al. 2003). (see section 7.1.3.2.3 details) Epsilon value [f]: Value of the parameter controlling how much haplotypes differing by a single mutation from potentially present haplotypes are weighted. Recommended values: 0.1 for microsatellite data, and 0.01 for other data types. (see section 7.1.3.2.3 details) Heterozygote site influence zone [i]: Defines the number of sites adjacent to heterozygote sites that need to be taken into account when computing haplotype frequencies in the Gibbs chain. A value of zero implies that gametic phase will be
Methodological outlines
70
estimated only on the basis of heterozygote sites. A negative value will indicate that all sites (homozygotes and heterozygotes will be used). This parameter is mostly useful for inferring gametic phase of DNA sequences where there is only a few heterozygote sites among long stretches of homozygous sites. (see section 7.1.3.2.3 details) Gamma value [f]: This parameter prevents adaptive windows where gametic phase is estimated to grow too much. It can be set to zero for microsatellite data, and to a small value for other data sets, like 0.01. (see section 7.1.3.2.3 details) Sampling interval [i]: It is the number of steps in the Gibbs chain between two consecutive samples of gametic phases. Number of samples [i]: It represents the number of samples of gametic phases one wants to draw in the Gibbs chain to get the posterior distribution of gametic phases (and haplotype frequencies) for each individual. (see section 7.1.3.2.3 details) Burnin steps [i]: It is the number of steps to perform in the Gibbs chain before sampling gametic phases. The total number of steps in the chain will thus be: Burnin steps + (Number of samples H Sampling interval). (see section 7.1.3.2.3 details) Recombination steps [i]: It is the proportion of steps in the Gibbs chain consisting in implementing a pseudo-recombination phase update instead of a simple phase switch (corresponding to a double recombination around a heterozygous site) (see section 7.1.3.2.3 details). Output phase distribution files [b]: Controls if one wants to output Arlequin files with the gametic phase of each sample in the Gibbs chain. The arlequin files (as many as the variable Number of samples defined above) are written in a subdirectory of the result directory called PhaseDistribution. They have the name ELB_EstimatedPhase#<Sample number>.arp. Arlequin also outputs a file called ELB_Best_Phases.arp containing for each individual the gametic phases estimated with the ELB algorithm, as well as batch file ELB_PhaseDistribution.arb listing all aforementioned project files.
Methodological outlines
71
Use EM algorithm to estimate ML haplotype frequencies [b]: We estimate the maximum-likelihood (ML) haplotype frequencies from the observed data using an Expectation-Maximization (EM) algorithm for multi-locus genotypic data when the gametic phase is not known, or when recessive alleles are present (see section 7.1.3.2). Perform EM algorithm at the: Haplotype level [m]: Estimate haplotype frequencies for haplotypes defined by alleles at all loci. Locus level [m]: Estimate allele frequencies for each locus. Haplotype and locus levels [m]: The two previous options are performed one after the other. Epsilon value [l]: Threshold for stopping the EM algorithm. After each iteration, Arlequin checks if the current haplotype frequencies are different from those at the previous iteration. If the sum of difference is smaller than epsilon, the algorithm stops.
Methodological outlines
72
Significant digits for output [l]: Precision required for output of haplotype frequencies. Haplotypes having a zero frequency given the required precisin are not output in the result file. Number of starting points for EM algorithm:[i]: Set the number of random initial conditions from which the EM algorithm is started to repeatedly estimate haplotype frequencies. The haplotype frequencies globally maximizing the likelihood of the sample will be kept eventually. Figures of 50 or more are usually in order. Maximum no. of iterations [i]: Set the maximum number of iterations allowed in the EM algorithm. The iterative process will have at most this number of iterations, but may stop before if convergence has been reached. Here, convergence is reached when the sum of the differences between haplotypes frequencies between two successive iterations is smaller than the epsilon value defined above. Use Zipper version of EM [b]: Use the zipper version of the EM algorithm consisting in building haplotypes progressively by adding one locus at a time (see section 7.1.3.2.2). No. of loci orders [l]: Defines how many random loci orders should be used in the zipper version of the EM algorithm. Results about haplotype frequencies obtained for the locus order leading to the best likelihood is shown in the result file. Recessive data [b]: Specify whether a recessive allele is present. This option applies to all loci. The code for the recessive allele can be specified in the project file (see section 3.2.1). Estimate standard deviation through bootstrap [b]: Uses a bootstrap approach to estimate the standard deviation of haplotype frequencies. No. of bootstrap to perform [i]: Set the number of parametric bootstrap replicates of the EM estimation process on random samples generated from a fictive population having haplotype frequencies equal to previously estimated ML frequencies. This procedure is used to generate the standard deviation of haplotype frequencies. When set to zero, the standard deviations are not estimated. No. of starting points for s.d. estimation [i]: Set the number of initial conditions for the bootstrap procedure. It may be smaller than the number of initial conditions set when estimating the haplotype frequencies, because the bootstrap replicates are quite time-consuming. Setting this number to small values is conservative, in the sense that it usually inflates the standard deviations.
Methodological outlines
73
Linkage disequilibrium between all pairs of loci[b]: Test for the presence of significant association between pairs of loci, based on an exact test of linkage disequilibrium. This test can be done with all data types except FREQUENCY data type. The number of loci can be arbitrary, but if there are less than two polymorphic loci, there is no point performing this test. The test procedure is analogous to Fishers exact test on a two-by-two contingency table but extended to a contingency table of arbitrary size (see section 7.1.4.1). No. of steps in Markov chain [i]: The maximum number of alternative tables to explore. Figures of 100,000 or more are in order. Larger values will lead to a better precision of the P-value as well as its estimated standard deviation. No. of dememorization steps [i]: The number of steps to perform before beginning to compare the alternative table probabilities to that of the observed table. It corresponds to a burnin. A few thousands steps are necessary to reach a
Methodological outlines
74
random starting point corresponding to a table independent from the observed table. LD coefficients between pairs of alleles at different loci Compute D, D and r2 coefficients [b] (between all pairs of alleles at different loci): See section 7.1.4.3 1) D: The classical linkage disequilibrium coefficient measuring deviation from random association between alleles at different loci (Lewontin and Kojima, 1960) expressed as D = p ij p i p j . 2) D: The linkage disequilibrium coefficient D standardized by the maximum value it can take ( D
max
D2 . pi (1 pi ) p j (1 p j )
o Generate histogram and table [b]: Generates a histogram of the number of loci with which each locus is in disequilibrium, and an s by s table (s being the number of polymorphic loci) summarizing the significant associations between pairs of loci. This table is generated for different levels of polymorphism, controlled by the value y: a locus is declared polymorphic if there are at least 2 alleles with y copies in the sample (Slatkin, 1994a). This is done because the exact test is more powerful at detecting departure from equilibrium for higher values of y (Slatkin 1994a). The results are output in a file called ld_dis.xl. Significance level [f]: The level at which the test of linkage disequilibrium is considered significant for the output table 6.3.8.5.1.2 Gametic phase unknown When the gametic phase is not known, we use a different procedure for testing the significance of the association between pairs of loci (see section 7.1.4.2). It is based on a likelihood ratio test, where the likelihood of the sample evaluated under the hypothesis of no association between loci (linkage equilibrium) is compared to the likelihood of the sample when association is allowed (see Slatkin and Excoffier, 1996). The significance of the observed likelihood ratio is found by computing the null distribution of this ratio under the hypothesis of linkage equilibrium, using a permutation procedure.
Methodological outlines
75
Linkage disequilibrium between all pairs of loci[b]: perform the likelihood-ratio test (see section 7.1.4.2). No. of permutations [i]: Number of random permuted samples to generate. Figures of several thousands are in order, and 16,000 permutations guarantee to have less than 1% difference with the exact probability in 99% of the cases (Guo and Thomson, 1992). A standard error for the estimated P-value is estimated using a system of batches (Guo and Thomson, 1992). No. of initial conditions for EM [i]: Sets the number of random initial conditions from which the EM is started to repeatedly estimate the sample likelihood. The haplotype frequencies globally maximizing the sample likelihood will be eventually kept. Figures of 100 or more are in order. Generate histogram and table [b]: Generates an histogram of the number of loci with which each locus is in disequilibrium, and an s by s table (s being the number of polymorphic loci) summarizing the significant associations between pairs of loci. This table is generated for different levels of polymorphism, controlled by the value y: a locus is declared polymorphic if there are at least 2 alleles with y copies in the sample (Slatkin, 1994a). This is done because the exact test is more
Methodological outlines
76
powerful at detecting departure from equilibrium for higher values of y (see Slatkin 1994a). The results are output in a file called ld_dis.xl. Significance level [f]: The level at which the test of linkage disequilibrium is considered significant for the output table. 6.3.8.5.2 Hardy-Weinberg equilibrium
Perform exact test of Hardy-Weinberg equilibrium [b]: Test of the hypothesis that the observed diploid genotypes are the product of a random union of gametes. This test is only possible for genotypic data. Separate tests are carried out at each locus. This test is analogous to Fishers exact test on a two-by-two contingency table but extended to a contingency table of arbitrary size (see section 7.1.5). If the gametic phase is unknown the test is only possible locus by locus. For data with known gametic phase, it is also possible to test the association at the haplotypic level within individuals. No. of steps in Markov chain [i]: The maximum number of alternative tables to explore. Figures of 100,000 or more are in order.
Methodological outlines
77
No. of dememorisation steps [i]: The number of steps to perform before beginning to compare the alternative table probabilities to that of the observed table. A few thousands steps are necessary to reach a random starting point corresponding to a table independent from the observed table. HWE test type o o Locus by locus [m]: Perform separate HWE test for each locus. Whole haplotype [m]: Perform a HWE test at the haplotype level (if gametic phase is available). o Locus by locus and whole haplotype [m]: Perform both kinds of tests (if gametic phase is available).
Tests of selective neutrality, based either on the infinite-allele model or on the infinitesite model (see section 7.1.6). Infinite allele model
Ewens-Watterson neutrality tests [b]: Performs tests of selective neutrality based on Ewens sampling theory in a population at equilibrium (Ewens 1972).
Methodological outlines
78
These tests are currently limited to sample sizes of 2000 genes or less and 1000 different alleles (haplotypes) or less. Ewens-Watterson homozygosity test: This test, devised by Watterson (1978, 1986), is based on Ewens sampling theory, but uses as a statistic the quantity F equal to the sum of squared allele frequencies, equivalent to the sample homozygosity in diploids (see section 7.1.6.1). Exact test based on Ewens sampling theory: In this test, devised by Slatkin (1994b, 1996), the probability of the observed sample is compared to that of a random neutral sample with same number of alleles and identical size. The probability of the sample selective neutrality is obtained as the proportion of random samples, which are less or equally probable than the observed sample. No. of simulated samples [i]: Number of random samples to be generated for the two neutrality tests mentioned above. Values of several thousands are in order, and 16,000 permutations guarantee to have less than 1% difference with the exact probability in 99% of the cases (see Guo and Thomson 1992).
Chakrabortys test of population amalgamation [b]: A test of selective neutrality and population homogeneity and equilibrium (Chakraborty, 1990). This test can be used when sample heterogeneity is suspected. It uses the observed homozygosity to estimate the population mutation parameter
value of this parameter is then used to compute the probability of observing k alleles or more in a neutral sample drawn from a stationary population. This test is based on Chakrabortys observation that the observed homozygosity is not very sensitive to population amalgamation or sample heterogeneity, whereas the number of observed (low frequency) alleles is more affected by this phenomenon. Infinite site model
Tajimas D [b]: This test described by Tajima (1989a, 1989b, 1993) compares two estimators of the population parameter
segregating sites in the sample, and the other being based on the mean number of pairwise differences between haplotypes. Under the infinite-site model, both estimators should estimate the same quantity, but differences can arise under selection, population non-stationarity, or heterogeneity of mutation rates among sites (see section 7.1.6.4).
Fus FS [b]: This test described by Fu (1997) is based on the probability of observing k or more alleles in a sample of a given size, conditioned on the observed average number of pairwise differences. The distribution of the statistic is obtained
Methodological outlines
79
pairwise differences. This test has been shown to be especially sensitive to departure from population equilibrium as in case of a population expansion (see section 7.1.6.4).
Haplotype definition The way haplotypes are defined is important here since some tests are based on the number of alleles in the samples, and therefore it is better to re-evaluate this quantity before doing these tests (Chakraborty's test, Ewens-Watterson, and Fu's Fs). Use original definition [m]: Haplotypes are identified according to their original identifier, without considering the fact that their molecular definition could be identical. Infer from distance matrix [m] Similar haplotypes will be identified by computing a distance matrix based on the settings chosen above. When this option is activated, a search for shared haplotypes is automatically performed at the beginning of each run, and new haplotypes definitions and frequencies are computed for each population.
Methodological outlines
80
AMOVA computation [b]: Analysis of MOlecular VAriance framework and computation of a Minimum Spanning Network among haplotypes. Estimate genetic structure indices using information on the allelic content of haplotypes, as well as their frequencies (Excoffier et al. 1992). The information on the differences in allelic content between haplotypes is entered as a matrix of Euclidean squared distances. The significance of the covariance components associated with the different possible levels of genetic structure (within individuals, within populations, within groups of populations, among groups) is tested using non-parametric permutation procedures (Excoffier et al. 1992). The type of permutations is different for each covariance component (see section 7.2). The minimum spanning tree and network is computed among all haplotypes defined in the samples included in the genetic structure to test (see section 7.2.2).
Methodological outlines
81
The number of hierarchical levels of the variance analysis and the kind of permutations that are done depend on the kind of data, the genetic structure that is tested, and the options the user might choose. All details will be given in section 7.2. Locus by locus AMOVA [b]: A separate AMOVA can be performed for each locus separately. For this purpose, we use the same number of permutations as in the global Amova. This procedure should be favored when there is some missing data. Compute Population Specific FST's [b]: Population specific FST indices will be computed (as defined in section 7.2.4) for all loci and for each locus separately if the Locus by locus AMOVA option is checked. Note that this option is only available if a single group is defined in the [[Structure]] section. No test of these coefficients is performed as they are only provided for exploratory purposes. No. of permutations [i]: Enter the number of permutations used to test the significance of covariance components and fixation indices. A value of zero will not lead to any testing procedure. Values of several thousands are in order for a proper testing scheme, and 16 000 permutations guarantee to have less than 1% difference with the exact probability in 99% of the cases (Guo and Thomson 1992). The number of permutations used by the program might be slightly larger. This is the consequence of subdivision of the total number of permutation in batches for estimating the standard error of the P-value. Note that if several covariance components need to be tested, the probability of each covariance component will be estimated with this number of permutation. The distribution of the covariance components is output into a tabulated text file called amo_hist.xl, which can be directly read into MSEXCEL . Compute Minimum Spanning Network (MSN) among haplotypes. A Minimum Spanning Tree and a Minimum Spanning Network are computed from the distance matrix used to perform the AMOVA calculations. Choice of Euclidian square distances [m]: o Use project distance matrix [m]: Use the distance matrix defined in the project file (if available) o Compute distance matrix [m]: Compute a given distance matrix based on a method defined below. With this setting selected, the distance matrix potentially defined in the project file will be ignored. This matrix can be
Methodological outlines
82
generated either for haplotypic data or genotypic data (Michalakis and Excoffier, 1996) o Use conventional F-statistics [m]: With this setting activated, we will use a lower diagonal distance matrix, with zeroes on the diagonal and ones as off-diagonal elements. It means that all distances between non-identical haplotypes will be considered as identical, implying that one will bas the analysis of genetic structure only on allele frequencies. Distance between haplotypes [m]: Select a distance method to compute the distances between haplotypes. Different square Euclidean distances can be used depending on the type of data analyzed. o Gamma a value [f]: Set the value for the shape parameter of the gamma function, when selecting a distance allowing for unequal mutation rates among sites. See the Molecular diversity section 7.1.2.5. 6.3.8.7.1.2 AMOVA with genotypic data
Compared to haplotypic data, it become possible to compute the average inbreeding coefficient FIS with diploid genotypic data. Include individual level for genotype data [b]: Include the intra-individual covariance component of genetic diversity, and its associated inbreeding coefficients (FIS and FIT). It thus takes into account the differences between
Methodological outlines
83
genes found within individuals. This is another way to test for global departure from Hardy-Weinberg equilibrium. The selection of this option is only possible for genotypic data with known gametic phase. 6.3.8.7.2 Population comparison
Population comparisons [b]: Computes different indexes of dissimilarities (genetic distances) between pairs of populations, like FST statistics and transformed pairwise FST s that can be used as short term genetic distances between populations (Reynolds et al. 1983; Slatkin, 1995), but also Neis mean number of pairwise differences within and between pairs of populations. The significance of the genetic distances is tested by permuting the haplotypes or individuals between the populations. See section 7.2.3 for more details on the output results (genetic distances and migration rates estimates between populations). Compute pairwise FST [b]: Computes pairwise FST s for all pairs of populations
Methodological outlines
84
Slatkins distances [b]: Computes Slatkins (1995) genetic distance derived from pairwise FST (see section 7.2.5.2). Reynoldss distance [b]: Computes Reynolds et al. (1983) linearized FST for short divergence time (see section 7.2.5.1). Compute pairwise differences [b]: Computes Neis average number of pairwise differences within and between populations (Nei and Li, 1979) (see section 7.2.5.4) o Estimate relative population sizes [b]: Computes relative population sizes for al pairs of populations, as well as divergence times between populations taking into account these potential differences between population sizes (Gaggiotti and Excoffier 2000) (see section 7.2.5.5) No. of permutations [i]: Enter the required number of permutations to test the significance of the derived genetic distances.. If this number is set to zero, no testing procedure will be performed. Note that this procedure is quite time consuming when the number of populations is large. Significance level [f]: The level at which the test of differentiation is considered significant for the output table. If the P-value is smaller than the Significance level, then the two populations are considered as significantly different. Choice of Euclidian distance [m]: Select a distance method to compute the distances between haplotypes. Different square Euclidean distances can be used depending on the type of data analyzed. o Use project distance matrix [m]: Use the distance matrix defined in the project file (if available) o Compute distance matrix [m]: Compute a given distance matrix based on a method defined below. With this setting selected, the distance matrix potentially defined in the project file will be ignored. This matrix can be generated either for haplotypic data or genotypic data (Michalakis and Excoffier, 1996). o Gamma a value [f]: Set the value for the shape parameter a of the gamma function, when selecting a distance allowing for unequal mutation rates among sites. See the Molecular diversity section 7.1.2.5. This parameter only applies to DNA data. o Use conventional F-statistics [m]: With this setting activated, we will use a lower diagonal distance matrix, with zeroes on the diagonal and ones as off-diagonal elements. It means that all distances between non-
Methodological outlines
85
identical haplotypes will be considered as identical, implying that one will bas the analysis of genetic structure only on allele frequencies. 6.3.8.7.3 Population differentiation
Exact test of population differentiation [b]: We test the hypothesis of random distribution of the individuals between pairs of populations as described in Raymond and Rousset (1995) and Goudet et al. (1996). This test is analogous to Fishers exact test on a two-by-two contingency table, but extended to a contingency table of size two by (no. of haplotypes). We do also an exact differentiation test for all populations defined in the project by constructing a table of size (no. of populations) by (no. of haplotypes). (Raymond and Rousset, 1995). No. of steps in Markov chain [i]: The maximum number of alternative tables to explore. Figures of 100,000 or more are in order. Larger values of the step number increases the precision of the P-value as well as its estimated standard deviation. No. of dememorisation steps [i]: The number of steps to perform before beginning to compare the alternative table probabilities to that of the observed
Methodological outlines
86
table. Corresponds to a burnin. A few thousands steps are necessary to reach a random starting point corresponding to a table independent from the observed table. Generate histogram and table [b]: Generates a histogram of the number of populations which are significantly different from a given population, and a PP table (P being the number of populations) summarizing the significant associations between pairs of populations. An association between two populations is considered as significant or not depending on the significance level specified below. Significance level [f]: The level at which the test of differentiation is considered significant for the output table. If the P-value is smaller than the Significance level, then the two populations are considered as significantly different.
Perform genotype assignment for all pairs of populations: Computes the log likelihood of the genotype of each individual in every sample, as if it was
Methodological outlines
87
drawn from a population sample having allele frequencies equal to those estimated for each sample (Paetkau et al. 1997; Waser and Strobeck, 1998). Multi-locus genotype likelihoods are computed as the product of each locus likelihood, thus assuming that the loci are independent. The output result file lists, for each population, a table of the log-likelihood of each individual genotype in all populations (see section 7.2.7).
Compute correlation between distance matrices: Test the correlation or the partial correlations between 2 or 3 matrices by a permutation procedure (Mantel, 1967; Smouse et al. 1986). Number of permutations: Sets the number of permutations for the Mantel test
Methodological outlines
88
7 METHODOLOGICAL OUTLINES
The following table gives a rapid overview of the methods implemented in Arlequin. A indicates that the task corresponding to the table entry is possible. Some tasks are only possible or meaningful if there is no recessive data, and those cases are marked with a r Data types DNA & RFLP Types of computations Standard indices Molecular diversity Mismatch distribution Haplotype (or allele) frequency estimation Linkage disequilibrium Hardy-Weinberg equilibrium r Tajimas neutrality test Fus neutrality test Ewens-Watterson neutrality tests Chakrabortys amalgamation test Search for shared haplotypes between samples AMOVA Minimum Spanning Network1 Pairwise genetic distances Exact test of population differentiation Individual assignment test Mantel test G+: Genotypic data, gametic phase known G- : Genotypic data, gametic phase unknown H : Haplotypic data 1 Computation of minimum spanning network between haplotypes is only possible if a distance matrix is provided or if it can be computed from the data. r r r r r r G+ GH Microsat G+ GH Standard G+ GH Frequency
Methodological outlines
89
7.1 Intra-population level methods 7.1.1 Standard diversity indices 7.1.1.1 Gene diversity
Equivalent to the expected heterozygosity for diploid data. It is defined as the probability that two randomly chosen haplotypes are different in the sample. Gene diversity and its sampling variance are estimated as
n (1 pi2 ) H= n 1 i =1
V( H ) =
where n is the number of gene copies in the sample, k is the number of haplotypes, and p is the sample frequency of the i-th haplotype. i Note that Arlequin outputs the standard deviation of the Heterozygosity computed as
s.d .( H ) = V ( H ) .
Reference: Nei, 1987, p.180.
=
i =1
j =1
pi p j dij ,
Methodological outlines
90
V( ) =
3n(n + 1) + 2 (n 2 + n + 3) 2 11(n 2 7 n + 6)
(Tajima, 1993)
Note that similar formulas are also used for Microsat and Standard data, even though the underlying assumptions of the model may be violated. Note also that Arlequin
7.1.2.2 Nucleotide diversity or average gene diversity over L loci (RFLP and DNA data)
It is the probability that two randomly chosen homologous nucleotides are different. It is equivalent to the gene diversity at the nucleotide level.
n =
i =1 j < i
pi p j d ij L
V( n ) =
Note that similar formulas are used for computing the average gene diversity over L loci for Microsat and Standard data, assuming no recombination and selective neutrality. As above, one should be aware that these assumptions may not hold for these data types. Note also that Arlequin outputs the standard deviation of the
Methodological outlines
91
= 2 Mu , where M
is
equal to 2 N for diploid populations of size N , or equal to N for haploid populations, and u is the overall mutation rate at the haplotype level. 7.1.2.3.1 Theta(Hom) The expected homozygosity in a population at equilibrium between drift and mutation is usually given by
H=
1 . +1
However, Zouros (1979) has shown that this estimator was an overestimate when estimated from a single or a few loci. Although he gave no closed form solution, Chakraborty and Weiss (1991) proposed to iteratively solve the following relationship between the expectation of
(Zouros, 1979)
of
Chakraborty and Weiss (1991) give an approximate formula for the standard error of
as
s.d.( H )
where s.d.( H ) is the standard error of H given in section 7.1.1.1. 7.1.2.3.2 Theta(S)
between the number of segregating sites (S), the sample size (n) and for a sample of non-recombining DNA:
=
where
S a1
n 1
a1 =
1 . i i =1
Methodological outlines
92
(Tajima, 1989)
a2 =
2 i =1 i
7.1.2.3.3 Theta(k)
E(k ) =
n 1
1 i =0 + i
Instead of the variance of , we give the limits ( and ) of a 95% confidence k 0 1 interval around , obtained from Ewens (1972) k
Pr(less than k alleles | = 0 ) = 0.025 Pr( more than k alleles | = 1 ) = 0.025 ,
These probabilities are obtained by summing up the probabilities of observing k' alleles (k'=0,...,k), obtained as (Ewens, 1972)
Sk k n Pr( K = k | ) = S ( ) n
where S k is a Stirling number of the first kind (see Abramovitz and Stegun, 1970), n and S ( ) is defined as n 7.1.2.3.4 Theta( )
( + 1)( + 2)( + n 1) .
is estimated from the infinite-site equilibrium relationship between the mean ): (Tajima, 1983)
Methodological outlines
93
F S ( , 0 ,1 ) = F S (1 ) + exp(
1 + 1 1
j! j =0
[F ( ) F ( )] , S j 0 S j 1
(Li, 1977)
where F ( ) = S
S
( + 1) S +1
= 2ut , and u
there are no coalescent events after the expansion, which is only reasonable if the expansion size is large. With this simplifying assumption, it is possible to derive the moment estimators of the time to the expansion ( ) and the mutation parameter as
0 = v m = m 0
(Rogers, 1995)
where m and v are the mean and the variance of the observed mismatch distribution, respectively. These estimators can then be used to plot F S ( , , ) values. Note, 0 however, that this estimation cannot be done if the variance of the mismatch is smaller than the mean.
Methodological outlines
94
However, Schneider and Excoffier (1999) find that this moment estimator often leads to an underestimation of the age of the expansion (). They rather propose to estimate the parameters of the demographic expansion by a generalized non-linear least-square approach. This is the method we now use to estimate the parameters of the demographic expansion , 0, and 1. Approximate confidence intervals for those parameters are obtained by a parametric bootstrap approach. The principle is the following: We computed approximate
* * 0 ,1 and * .
For a given confidence level , the approximate limits of the confidence interval were obtained as the /2 and 1-/2 percentile values (Efron, 1993, p. 168). It is important to underline that this form of parametric bootstrap assumes that the data are distributed according the sudden expansion model. In Schneider and Excoffier (1999), we showed by simulation that only the confidence interval (CI) for
has a good
coverage (i.e. that the true value of the parameter is included in a 100x(1-)% CI with a probability very close to 1-.). The CI of the other two parameters are overly large (the true value of the parameter was almost always included in the CI), and thus too conservative. The validity of the estimated stepwise expansion model is tested using the same parametric bootstrap approach as described above. We used here the sum of square deviations (SSD) between the observed and the expected mismatch as a test statistic. We obtained its distribution under the hypothesis that the estimated parameters are the true ones, by simulating B samples around the estimated parameters. As before, we reestimated each time new parameters
P=
Methodological outlines
95
For convenience, we also compute the raggedness index of the observed distribution defined by Harpending (1994) as
d +1
r=
( xi xi 1 ) 2 ,
i =1
where d is the maximum number of observed differences between haplotypes, and the
x's are the observed relative frequencies of the mismatch classes. This index takes
larger values for multimodal distributions commonly found in a stationary population than for unimodal and smoother distributions typical of expanding populations. Its significance is tested similarly to that of SSD. References: Rogers and Harpending (1991) Rogers (1995) Schneider and Excoffier (1999) Excoffier (2004) 7.1.2.4.2 Spatial expansion A population spatial expansion generally occurs if the range of a population is initially restricted to a very small area, and then the range of the population increases over time and over space. The resulting population becomes generally subdivided in the sense that individuals will tend to mate with geographically close individuals rather than remote individuals. Based on simulations, Ray et al. (2003) have shown that a large spatial expansion can lead to the same signal in the mismatch distribution than a pure demographic expansion in a panmictic population, but only if neighboring sub-populations (demes) exchange many migrants (50 or more). The simulations performed in Ray et al. (2003) were performed in a two-dimensional stepping-stone model. T generations ago, a haploid population restricted to a single deme of size N, began to send migrants to neighboring demes at rate m, progressively colonizing the whole world. During the expansion, the size of each deme followed a logistic regulation with carrying capacity K, and intrinsic rate of growth r. During the whole process neighboring demes continue to exchange a fraction m of migrants. While this model is difficult to describe analytically, Excoffier (2004) derived the expected mismatch distribution under a simpler model of spatial expansion. He assumed that one has sampled genes from a single deme belonging to a population subdivided into a infinite number of demes, each of size N, which would exchange a fraction m of migrants wirh other demes. This infinite-island model is actually equivalent to a continent-island model, where the sampled deme would exchange migrants at rate m with a unique
Methodological outlines
96
population of infinite size. Some T generations in the past, the continent-island system would be reduced to a single deme of size N0, like: Continent-island model
m
N
N=
T generations ago
N0
Under this simple model, the probability that two genes currently sampled in the small deme of size N differ at S sites is given by
F0 ( S ; M , 0 ; 1 , ) =
where
AS +1
1S
we assume that N=N0), and M=2Nm, using the same least-square method as described in the case of the estimation of the parameters of a demographic expansion (see section 7.1.2.4.1). Like for the demographic expansion, we also provide the expected mismatch distribution and test the fit to the model by coalescent simulations of an instantaneous expansion under the continent-island model defined above. References: Ray et al (2003) Excoffier (2004)
L:
Gamma correction:
Number of loci This correction is proposed when the mutation rates cannot be assumed as uniform for all sites. It had been originally
Methodological outlines
97
proposed for mutation rates among amino acids (Uzell and Corbin, 1971), but it seems also to be the case of the control region of human mtDNA (Wakeley, 1993). In such a case, a Gamma distribution of mutation rates is often assumed. The shape of this distribution (the unevenness of the mutation rates) is mainly controlled by a parameter a, which is the inverse of the coefficient of variation of the mutation rate. The smaller the a coefficient , the more uneven the mutation rates. A uniform mutation rate corresponds to the case where
a is equal to infinity.
nd ns nv
: : : Number of observed substitutions between two DNA sequences Number of observed transitions between two DNA sequences Number of observed transversions between two DNA sequences G+C ratio, computed on all the DNA sequences of a given sample 7.1.2.5.1 Pairwise difference Outputs the number of loci for which two haplotypes are different
d = nd
V(d ) = d ( L d ) / L
7.1.2.5.2 Percentage difference Outputs the percentage of loci for which two haplotypes are different
d = nd / L V(d ) = d (1 d ) / L
7.1.2.5.3 Jukes and Cantor Outputs a corrected percentage of nucleotides for which two haplotypes are different. The correction allows for multiple substitutions per site since the most recent common ancestor of the two DNA sequences. The correction also assumes that the rate of nucleotide substitution is identical for all 4 nucleotides A, C, G and T.
p = nd / L
Methodological outlines
98
3 4 d = log(1 p) 4 3
V(d ) =
p(1 p) 4 (1 p) 2 L 3
Gamma correction:
3 4 d = a [ (1 p) 1 / a 1 4 3
Gamma correction:
References:
Methodological outlines
99
Outputs a corrected percentage of nucleotides for which two haplotypes are different. The correction is an extension of Kimura 2-parameters method, allowing for unequal nucleotide frequencies. The transition-transversion ratios, as well as the overall nucleotide frequencies are computed from the original data.
n n P= s, Q= v L L c1 = 1 1 P 2 (1 ) 1 , c3 = 2 (1 )(c1 c2 ) + c2 , c2 = 1 2Q
d = 2 (1 ) log(1
P 1 Q) (1 2 (1 )) log(1 2Q) 2 (1 ) 2
References: Tamura, 1992, Kumar et al. 1993 7.1.2.5.6 Tajima and Nei Outputs a corrected percentage of nucleotides for which two haplotypes are different. The correction is an extension of Jukes and Cantor method, allowing for unequal nucleotide frequencies. The overall nucleotide frequencies are computed from the data.
2 4 3 4 xij 2 1 2+ p , c= , b = (1 p= gi , 2 2gi g j L c i =1 i =1 j = i +1
nd
where the g's are the four nucleotide frequencies, and x is the relative frequency of ij the nucleotide pair i and j .
Methodological outlines
100
Outputs a corrected percentage of nucleotides for which two haplotypes are different. Like Kimura 2-parameters, and Tajima and Nei distances, the correction allows for different transversion and transition rates, but a distinction is also made between transition rates between purines and between pyrimidines.
c1 =
2 g A gG gR
, c2 =
2 g C gT gY
, c3 =
2 g A gG g R
2 2 g A gG g R g R P g A gGQ 1
c4 =
2 gT gC gY 2 2 gT gC gY gY P2 gT gC Q
c5 =
2 2 g 2 gG A 2 g R (2 g A g G g R g R P g A g G Q) 1
2 2 2 gT g C 2 gY (2 gT g C gY gY P2 gT g C Q)
2 2 2 2 2 g R ( gT + g C ) + gY ( g 2 + g G ) A + 2 2 2 g R gY g R gY Q
n P = ns ( A G ), P2 = ns (C T ), Q = s 1 nd
d=
P P Q Q c1 log(1 1 ) c2 log(1 2 ) c1 2 g R c2 2 gY
Q ) 2( g R gY c1 gY c2 g R ) log(1 2 g R gY
Methodological outlines
101
c 2 P + c 2 P + c 2Q (c3 P + c4 P2 + c5 Q ) 2 1 ) = 3 1 4 2 5 V(d L
Gamma correction:
P P d = 2a [ c (1 1 Q ) 1 / a + c (1 2 Q ) 1 / a 2 1 c1 2 g R c2 2 gY
g + ( g R gY Y c1
gR c2
)(1
Q ) 1 / a 2 g A g G 2 g T g C 2 g R g Y 2 g R gY
c 2 P + c 2 P + c 2Q (c3 P + c4 P2 + c5 Q ) 2 1 ) = 3 1 4 2 5 V(d L
References: Tamura and Nei, 1994, Kumar et al. 1993
d xy =
where
xy (i)
i =1
identical for both haplotypes, and equal to 0 otherwise. When estimating genetic structure indices, this choice amounts at estimating weighted
FST statistics over all loci (Weir and Cockerham, 1984; Michalakis and Excoffier, 1996).
7.1.2.6.2 Proportion of difference We simply count the proportion of loci that are different between two RFLP haplotypes.
1 d xy = (i ) L i =1 xy
where
Methodological outlines
102
When estimating genetic structure indices, this choice will lead to exactly the same results as the number of pairwise differences.
d xy =
where
xy (i)
i =1
identical for both haplotypes, and equal to 0 otherwise. When estimating genetic structure indices, this choice amounts at estimating weighted
FST statistics over all loci (Weir and Cockerham, 1984; Michalakis and Excoffier, 1996).
7.1.2.7.2 Sum of squared size difference Counts the sum of the squared number of repeat difference between two haplotypes (Slatkin, 1995).
d xy =
where a
xi
(a xi a yi ) 2
i =1
When estimating genetic structure indices, this choice amounts at estimating an analog of Slatkin's RST (1995) (see Michalakis and Excoffier, 1996, as well as Rousset, 1996 , for details on the relationship between FST and RST) .
d xy =
where
xy (i)
i =1
Methodological outlines
103
When estimating genetic structure indices, this choice amounts at estimating weighted
FST statistics over all loci (Weir and Cockerham, 1984; Michalakis and Excoffier, 1996). 7.1.2.9 Minimum Spanning Network among haplotypes
We have implemented the computation of a Minimum Spanning Tree (MST) (Kruskal, 1956; Prim, 1957) between OTUs (Operational Taxonomic Units). The MST is computed from the matrix of pairwise distances calculated between all pairs of haplotypes using a modification of the algorithm described in Rohlf (1973). The Minimum Spanning Network embedding all MSTs (see Excoffier and Smouse 1994) is also provided. This implementation is the translation of a standalone program written in Pascal called MINSPNET.EXE running under DOS, formerly available on http://anthropologie.unige.ch/LGB/software/win/min-span-net/.
7.1.3 Haplotype inference 7.1.3.1 Haplotypic data or Genotypic data with known Gametic phase
If haplotype i is observed x times in a sample containing n gene copies, then its i
x pi = i , n
whereas an unbiased estimate of its sampling variance is given by
p (1 pi ) . V( pi ) = i n 1
7.1.3.2 Genotypic data with unknown Gametic phase
7.1.3.2.1 EM algorithm Maximum-likelihood haplotype frequencies can be estimated using an ExpectationMaximization (EM) algorithm (see e.g. Dempster et al. 1977; Excoffier and Slatkin, 1995; Lange, 1997; Weir, 1996). This procedure is an iterative process aiming at obtaining maximum-likelihood estimates of haplotype frequencies from multi-locus genotype data when the gametic phase is unknown (phenotypic data). In this case, a simple gene counting is not possible because several genotypes are possible for individuals heterozygote at more than one locus. Therefore, a slightly more elaborate procedure is needed.
Methodological outlines
104
The likelihood of the sample (the probability of the observed data D, given the haplotype frequencies - p ) is given by
L(D | p) =
Gij ,
i =1 j =1
gi
where the sum is over all n individuals of the sample, and the product is over all possible genotypes of those individuals, and G = 2 p p , if i j or G = p 2 , if i = j . ij i j ij i The principle of the EM algorithm is the following: 1) Start with arbitrary (random) estimates of haplotype frequencies. 2) Use these estimates to compute expected genotype frequencies for each phenotype, assuming Hardy-Weinberg equilibrium (The E-step). 3) The relative genotype frequencies are used as weights for their two constituting haplotypes in a gene counting procedure leading to new estimates of haplotype frequencies (The M-step). 4) Repeat steps 2-3, until the haplotype frequencies reach equilibrium (do not change more than a predefined epsilon value). Dempster et al (1977) have shown that the likelihood of the sample could only grow after each step of the EM algorithm. However, there is no guarantee that the resulting haplotype frequencies are maximum likelihood estimates. They can be just local optimal values. In fact, there is no obvious way to be sure that the resulting frequencies are those that globally maximize the likelihood of the data. This would need a complete evaluation of the likelihood for all possible genotype configurations of the sample. In order to check that the final frequencies are putative maximum likelihood estimates, one has generally to repeat the EM algorithm from many different starting points (many different initial haplotype frequencies). Several runs may give different final frequencies, suggesting the presence of several "peaks" in the likelihood surface, but one has to choose the solution that has the largest likelihood. It may also arise that several distinct peaks have the same likelihood, meaning that different haplotypic compositions explain equally well the observed data. At this point, there is no way to choose among the alternative solutions from a likelihood point of view. Some external information should be provided to make a decision. Standard deviations of the haplotype frequencies are estimated by a parametric bootstrap procedure (see e.g. Rice, 1995), generating random samples from a population assumed to have haplotype frequencies equal to their maximum-likelihood values. For each bootstrap replicate, we apply the EM algorithm to get new maximumlikelihood haplotype frequencies. The standard deviation of each haplotype frequency is
Methodological outlines
105
then estimated from the resulting distribution of haplotype frequencies. Note however that this procedure is quite computer intensive. Reference: Excoffier and Slatkin (1995) 7.1.3.2.2 EM zipper algorithm The EM zipper is a simple extension of the EM algorithm, aiming at speeding up the estimation process and allowing the handling of a much larger number of heterozygous sites per individual. The EM algorithm becomes indeed extremely slow when there are more than 20 heterozygous sites per individual, and it is therefore not suited for the analysis of long stretches of DNA with hundreds of polymorphic sites. The EM zipper therefore begins by estimating frequencies of two-locus haplotypes, and then adds another locus, to estimate 3-locus haplotype frequencies, and then adds another locus to get 4-locus haplotype frequencies, and so on until all loci have been added. At each stage, any n-locus genotype which incorporates a n-locus haplotype with estimated frequency equal to zero is prevented from being extended to n+1 loci, because it is likely that the frequency of an extended (n+1)-locus haplotype would have also been equal to zero. With this method, Arlequin does not need to build all possible genotypes for each individual, but it only considers the genotypes whose sub-haplotypes have nonnull frequencies, and one can thus handle a much larger number of polymorphic sites than the conventional EM algorithm. In Arlequin's tab dialog (see section 6.3.8.4.2.2), one can specify if the loci should be added in random order or not, and how many random orders to implement. After multiple trials, Arlequin outputs the locus order having led to the largest likelihood. This version of the EM algorithm is equivalent to that implemented in the SNPHAP program (http://www-gene.cimr.cam.ac.uk/clayton/software/snphap.txt) by David Clayton. 7.1.3.2.3 ELB algorithm Contrary to the EM algorithm which aims at estimating haplotype frequencies, the ELB algorithm attempts at reconstructing the (unknown) gametic phase of multi-locus genotypes. Phase updates are made on the basis of a window of neighbouring loci, and the window size varies according to the local level of linkage disequilibrium. Suppose that we have a sample of n individuals drawn from some population and genotyped at S loci whose chromosomal order is assumed known. Adjacent pairs of loci are assumed to be tightly linked, but S may be large so that the two external loci are effectively unlinked. In this case, reconstructing the gametic phase in one step can be inefficient, because recombination may have created too many distinct haplotypes for
Methodological outlines
106
their frequencies to be well estimated. Locally, however, recombination may be rare and to exploit this situation the updates in ELB of the phase at a heterozygous locus are based on windows of neighboring loci. The algorithm adjusts the window sizes and locations in order to maximize the information for the phase updates. ELB starts with an arbitrary phase assignment for all individuals in the sample. Associated with each heterozygous locus is a window containing the locus itself and neighboring loci At each iteration of the algorithm, an individual is chosen at random and its heterozygous loci are successively visited in random order. At each locus visit, two attempts are then made to update that window, by proposing, and then accepting or rejecting, (i) the addition of a locus at one end of the window, and (ii) the removal of a locus at the other end. The locus being visited is never removed from the window, and each window always includes at least one other heterozygous locus. The two update proposals are made sequentially so that the window can either grow by one locus, shrink by one locus, or, if both changes are accepted, the window slides by one locus either to the right or the left. If both proposals are rejected, the window remains unchanged. Next, the phase at the locus being visited is updated based on the current haplotype pairs, within the chosen window, of the other individuals in the sample. 7.1.3.2.3.1 Phase updates Let h11 and h22 denote the two haplotypes within the window given the current phase assignment, and let h12 and h21 denote the haplotypes which would result from the alternative phase assignment at the locus being visited. Ideally, we would wish to choose between the two haplotype assignments, h11/h22 and h12/h21, with probabilities proportional to their (joint) population frequencies. These are unknown, and in practice they are too small for direct estimation to be feasible. To overcome the latter problem we assume HWE, so that we now seek to choose between h11/h22 and h12/h21 with probabilities proportional to p11p22 and p12p21, where pij, i,j=1,2, denotes the population frequency of hij. Although the pij are also unknown, we can estimate them using the nij, the haplotype counts among the other n-1 individuals in the sample, given their current phase assignments within the window. Adopting a Bayesian posterior mean estimate of pij pij, based on a symmetric Dirichlet prior distribution for the pij with parameter
Pr h11 / h22
{nij }) = (n
Methodological outlines
107
Current phase in selected window ACCTCGCCT GCTATCTAG Switch phase update ACCTTGCCT GCTACCTAG
7.1.3.2.3.2 Recombination update Instead of performing a switch update as before, we can also update the phase using a recombination update, like:
Current phase in selected window ACCTCGCCT GCTATCTAG Right recombination phase update ACCTTCTAG GCTACGCCT
In that case, we choose to change the phase of all sites either located on the right or on the left of the focal site. The proportion of updates being recombination steps can be set up in ELB tab dialog as shown in section 6.3.8.4.2.1. A small value is in order (less than 5%) since it implies a large change which may often be rejected, and cause the chain not to mix properly. The rationale for this kind update (initially not described in Excoffier et al (2003) is to more largely explore the set of possible gametic phase by provoking a radical change from time to time. 7.1.3.2.3.3 Handling mutations Increasing thus allows more flexibility to choose new haplotypes, but this is a noisy solution: all unobserved haplotypes are treated the same. However, a recent mutation event can create haplotypes that are rare, but similar to a more common haplotype, whereas haplotypes that are very dissimilar to all observed haplotypes are highly implausible. This phenomenon is particularly prevalent for STR loci, with their relatively high mutation rates. To encapsulate the effect of mutation, when making a phase assignment we give additional weight to an unobserved haplotype for each observed haplotype that is close
Methodological outlines
108
to it. Here, we define close to mean differs at one locus, and in the phase update we choose h11/h22 rather than h12/h21 with probability
Pr h11 / h22
(2)
where nij _1 is the sample count of haplotypes that are close to hij within the current window. Since
be larger for STR than for SNP or DNA data. By simulation we have found that a value of
=0.1 gave good results for STR (microsatellite) data, and a value of =0.01 for other
data types worked well. 7.1.3.2.3.4 Sliding window size updates The value of R = max{r , 1/ r} , where r = p11p22/p12p21, gives a measure of linkage disequilibrium (LD) within the window. Broadly speaking, at each choice between two windows, we would generally prefer the window that gives the largest value to R. Based on (2), a natural estimate of r is
r=
(3)
Thus, at each attempt to update the length of a window in step 3) above, we choose
R2 R1 + R 2
(4)
Even a large value for can fail to prevent a window from growing too large when two consecutive heterozygous loci in an individual are separated by many homozygous loci. The window must then be large in order to contain the necessary minimum of two heterozygous loci. To circumvent the problem of small haplotype counts which may then
Methodological outlines
109
result, when updating an individuals phase allocation, we can ignore homozygous loci that are separated from the nearest heterozygous locus by more than an given number of intervening homozygous loci. This is the parameter called "Heterozygous site influence zone" to be chosen in ELB tab dialog in section 6.3.8.4.2.1. 7.1.3.2.3.5 Handling missing data In handling missing data, the philosophy underpinning ELB is to ignore the affected loci rather than to impute missing data or to augment the space of possible genotypes. In the presence of missing data, the haplotype counts nij and nij _1 are not necessarily integers: individuals with missing data at m loci within a current window of length L contribute 1-m/L to nij (or nij _1 ) for each haplotype at which the remaining L-m loci match hij exactly (or with one mismatch). Reference: Excoffier et al. (2003)
L0 =
where the nij's denote the count of the haplotypes that have the i-th allele at the first locus and the j-th allele at the second locus, ni* is the overall frequency of the i-th
Methodological outlines
110
allele at the first locus (i=1,... k1) and n*i is the count of the i-th allele at the second locus (i=1,... k2). Instead of enumerating all possible contingency tables, a Markov chain is used to efficiently explore the space of all possible tables. This Markov chain consists in a random walk in the space of all contingency tables. It is done is such a way that the probability to visit a particular table corresponds to its actual probability under the null hypothesis of linkage equilibrium. A particular table is modified according to the following rules (see also Guo and Thompson, 1992; or Raymond and Rousset, 1995) : 1) We select in the table two distinct lines i1, i2 and two distinct columns j1, j2 at random. 2) The new table is obtained by decreasing the counts of the cells (i1, j1) (i2, j2) and increasing the counts of the cells (i1, j2) (i2, j1) by one unit. This leaves the marginal allele counts ni unchanged. 3) The switch to the new table is accepted with a probability equal to
R=
L1 L0
(ni , j + 1)(ni , j + 1) 1 2 2 1 ni , j ni , j 1 1 2 2
where R is just the ratio of the probabilities of the two tables. The steps 1-3 are done a large number of times to explore a large amount of the space of all possible contingency tables having identical marginal counts. In order to start from a random initial position in the Markov chain, the chain is explored for a pre-defined number of steps (the dememorization phase) before the probabilities of the switched tables are compared to that of the initial table. The number of dememorization steps should be enough (some thousands) such as to allow the Markov chain to "forget" its initial state, and make it independent from its starting point. The P-value of the test is then taken as the proportion of the visited tables having a probability smaller or equal to the observed contingency table. A standard error on P is estimated by subdividing the total amount of required steps into
B batches (see Guo and Thompson, 1992, p. 367). A P-value is calculated separately for
each batch. Let us denote it by Pi (i=1,...,B). The estimated standard error is then calculated as
( P Pi ) 2
s.d .( P ) =
i =1
B ( B 1)
The process is stopped as soon as the estimated standard deviation is smaller than a predefined value specified by the user.
Methodological outlines
111
7.1.4.2 Likelihood ratio test of linkage disequilibrium (genotypic data, gametic phase unknown)
For genotypic data where the haplotypic phase is unknown, the test based on the Markov chain described above is not possible because the haplotypic composition of the sample is unknown, and is just estimated. Therefore, linkage disequilibrium between a pair of loci is tested for genotypic data using a likelihood-ratio test, whose empirical distribution is obtained by a permutation procedure (Slatkin and Excoffier, 1996). The likelihood of the data assuming linkage equilibrium ( L
H*
that, under this hypothesis, the haplotype frequencies are obtained as the product of the allele frequencies. The likelihood of the data not assuming linkage equilibrium ( L is obtained by applying the EM algorithm to estimate haplotype frequencies. The likelihood-ratio statistic given by
H
L * S = 2 log( H ) LH
should in principle follow a Chi-square distribution, with (k1-1) (k2-1) degrees of freedom, but it is not always the case in small samples with large number of alleles per locus. In order to better approximate the underlying distribution of the likelihood-ratio statistic under the null hypothesis of linkage equilibrium, we use the following permutation procedure: 1) Permute the alleles between individuals at one locus only. 2) Re-estimate the likelihood of the data L
H
unaffected by the permutation procedure. 3) Repeat steps 1-2 a large number of times to get the null distribution of L therefore the null distribution of S. Note that this test of linkage disequilibrium assumes Hardy-Weinberg proportions of genotypes, and the rejection of the test could be also due to departure from HardyWeinberg equilibrium (see Excoffier and Slatkin, 1998) Reference: Excoffier and Slatkin (1998)
H
Methodological outlines
112
D, D , and r2 coefficients:
Note that these coefficients are computed between all pairs of alleles at different loci, and that their computation assumes that the gametic phase between alleles at different loci is known . 1)
Dij = p ij p i p j ,
where
p ij is the frequency of the haplotype having allele i at the first locus p i and p j are the frequencies of
D'ij =
where
Dij , max takes one of the following values: Dij < 0 Dij > 0
r2 =
D2 . pi (1 pi ) p j (1 p j )
Methodological outlines
113
n! L0 =
ni * !
i =1 k i
2H ,
(2n)!
nij !
i =1 j =1
where H is the number of heterozygote individuals. Much like it was done for the test of linkage disequilibrium, we explore alternative contingency tables having same marginal counts. In order to create a new contingency table from an existing one, we select two distinct lines i1, i2 and two distinct columns j1,
j2 at random. The new table is obtained by decreasing the counts of the cells (i1, j1) (i2, j2) and increasing the counts of the cells (i1, j2) (i2, j1) by one unit. This leaves the
alleles counts ni unchanged. The switch to the new table is accepted with a probability R equal to :
1)
R=
Ln +1 Ln
ni j ni j 1 1 2 2
(1 + i j )(1 + i j ) 1 1 2 2
, if
i1 j1 or i2 j2
Methodological outlines
114
2)
ni j ni 2 j 2 L 4 1 1 R = n +1 = , Ln ( ni j + 1)(ni j + 2) 1 1 2 2 1 ni j ( ni j 1) 1 1 2 2
if
i1 = j1 and i2 = j2
3)
R=
Ln +1 Ln
1 , ( ni j + 1)(ni j + 1) 4 1 2 2 1
if
i1 = j2 and i2 = j1 .
As usual
denotes the Kronecker function. R is just the ratio of the probabilities of the
two tables. The switch to the new table is accepted if R is larger than 1. The P-value of the test is the proportion of the visited tables having a probability smaller or equal to the observed (initial) contingency table. The standard error on the P-value is estimated like in the case of linkage disequilibrium using a system of batches (see section 7.1.4.1). Reference: Guo and Thomson (1992)
Methodological outlines
115
Hom
) (see
section 7.1.2.3.1) to compute the probability of observing a random neutral sample with a number of alleles similar or larger than the observed value ( Pr( K k
obs
) (see
section 7.1.2.3.3 to see how this probability can be computed). It is an approximation of the conditional probability of observing some number of alleles given the observed homozygosity. References: Ewens (1972) Chakraborty (1990)
Methodological outlines
116
D=
where
S
Var ( S )
= and S = S /
n 1
sample. The limits of confidence intervals around D may be found in Table 2 of Tajima's paper (Tajima 1989a) for different sample sizes. The significance of the D statistic is tested by generating random samples under the hypothesis of selective neutrality and population equilibrium, using a coalescent simulation algorithm adapted from Hudson (1990). The P value of the D statistic is then obtained as the proportion of random FS statistics less or equal to the observation. We also provide a parametric approximation of the P-value assuming a beta-distribution limited by minimum and maximum possible D values (see Tajima 1989a, p.589). Note that significant D values can be due to factors other than selective effects, like population expansion, bottleneck, or heterogeneity of mutation rates (see Tajima, 1993; Aris-Brosou and Excoffier, 1996; or Tajima 1996, for further details). References: Tajima (1993) Aris-Brosou and Excoffier (1996) Tajima (1996)
S ' = Pr( K k obs | = ) and defines the FS statistic as the logit of S' S' ) FS = ln( 1 S'
expansion, which generally lead to large negative FS values. The significance of the FS statistic is tested by generating random samples under the hypothesis of selective neutrality and population equilibrium, using a coalescent (Fu, 1997)
Fu (1997) has noticed that the FS statistic was very sensitive to population demographic
Methodological outlines
117
simulation algorithm adapted from Hudson (1990). The P-value of the FS statistic is then obtained as the proportion of random FS statistics less or equal to the observation. Using simulations, Fu noticed that the 2% percentile of the distribution corresponded to the 5% cutoff value (i.e. the critical value of the test at the 5% significance level). We indeed confirmed this behavior by our own simulations. Even though this property is not fully understood, it means that a FS statistic should be considered as significant at the 5% level, if its P-value is below 0.02, and not below 0.05. Reference: Fu (1997
7.2 Inter-population level methods 7.2.1 Population genetic structure inferred by analysis of variance (AMOVA)
The genetic structure of population is investigated here by an analysis of variance framework, as initially defined by Cockerham (1969, 1973), and extended by others (see e.g. Weir and Cockerham, 1984; Long 1986). The Analysis of Molecular Variance approach used in Arlequin (AMOVA, Excoffier et al. 1992) is essentially similar to other approaches based on analyses of variance of gene frequencies, but it takes into account the number of mutations between molecular haplotypes (which first need to be evaluated). By defining groups of populations, the user defines a particular genetic structure that will be tested (see the input file notations for more details). A hierarchical analysis of variance partitions the total variance into covariance components due to intra-individual differences, inter-individual differences, and/or inter-population differences. See also Weir (1996), for detailed treatments of hierarchical analyses, and Excoffier (2000) as well as Rousset (2000) for an explanation why these are covariance components rather than variance components. The covariance components ( 2 's) are used to compute i fixation indices, as originally defined by Wright (1951, 1965), in terms of inbreeding coefficients, or later in terms of coalescent times by Slatkin (1991). Formally, in the haploid case, we assume that the i-th haplotype frequency vector from the j-th population in the k-th group is a linear equation of the form
x ijk = x + a k + b jk + c ijk .
Methodological outlines
118
The vector x is the unknown expectation of xijk, averaged over the whole study. The effects are a for group, b for population, and c for haplotypes within a population within a group, assumed to be additive, random, independent, and to have the associated covariance components
2 2 2 a , b , and c , respectively.
( 2 ) is the sum of the covariance component due to differences among haplotypes within a population ( 2 ), the covariance component due to differences among c haplotypes in different populations within a group ( 2 ), and the covariance component b due to differences among the G populations ( 2 ). The same framework could be a extended to additional hierarchical levels, such as to accommodate, for instance, the covariance component due to differences between haplotypes within diploid individuals. Note that in the case of a simple hierarchical genetic structure consisting of haploid individuals in populations, the implemented form of the algorithm leads to a fixation index FST which is absolutely identical to the weighted average F-statistic over loci,
w , defined by Weir and Cockerham (1984) (see Michalakis and Excoffier 1996 for a
formal proof). In terms of inbreeding coefficients and coalescence times, this FST can be expressed as
f f t t FST = 0 1 = 1 0 , 1 f1 t1
(Slatkin, 1991)
where f is the probability of identity by descent of two different genes drawn from the 0 same population, f is the probability of identity by descent of two genes drawn from 1 two different populations, t is the mean coalescence times of two genes drawn from 1 two different populations, and t from the same population. The significance of the fixation indices is tested using a non-parametric permutation approach described in Excoffier et al. (1992), consisting in permuting haplotypes, individuals, or populations, among individuals, populations, or groups of populations. After each permutation round, we recompute all statistics to get their null distribution. Depending on the tested statistic and the given hierarchical design, different types of permutations are performed. Under this procedure, the normality assumption usual in analysis of variance tests is no longer necessary, nor is it necessary to assume equality of variance among populations or groups of populations. A large number of
0
Methodological outlines
119
permutations (1,000 or more) is necessary to obtain some accuracy on the final probability. A system of batches similar to those used in the exact test of linkage disequilibrium (see end of section 7.1.4.1) has been implemented to get an idea of the standard-deviation of the P values. We have implemented here 6 different types of hierarchical AMOVA. The number of hierarchical levels varies from two to four. In each of the situations, we describe the way the total sum of squares is partitioned, how the covariance components and the associated F-statistics are obtained, and which permutation schemes are used for the significance test. Before enumerating all the possible situations, we introduce some notations: SSD(T) SSD (AG) SSD (AP) SSD (AI) SSD (WP) SSD (WI) : Total sum of squared deviations. : Sum of squared deviations Among Groups of populations. : Sum of squared deviations Among Populations. : Sum of squared deviations Among Individuals. : Sum of squared deviations Within Populations. : Sum of squared deviations Within Individuals.
SSD (AP/WG) : Sum of squared deviations Among Populations, Within Groups. SSD (AI/WP) : Sum of squared deviations Among Individuals, Within Populations. G P N : Number of groups in the structure. : Total number of populations. : Total number of individuals for genotypic data or total number of gene copies for haplotypic data. : Number of individuals in population p for genotypic data or total number of gene copies in population p for haplotypic data.
Np
Ng
: Number of individuals in group g for genotypic data or total number of gene copies in group g for haplotypic data..
Methodological outlines
120
N n= FST =
We test
N2 p N ,
P 1
2 a 2 T 2 a
.
and FST by permuting haplotypes among populations.
2 2 n b + c
2 c 2 T
Methodological outlines
121
SG =
g G p g
N2 p
g
, n=
N SG P G
SG n' =
N2 p
N pP , n' ' = G 1
g G
2 Ng
G 1
2 2 a +b 2 T
FCT =
We test We test We test
2 a 2 T
FSC =
2 b 2 2 b +c
and FST =
2 c 2 b 2 a
and FST by permuting haplotypes among populations among groups. and FSC by permuting haplotypes among populations within groups. and FCT by permuting populations among groups.
2N
n= FST =
P 2 a 2 T
2N 2 p N ,
P 1 .
We test
2 a
Methodological outlines
122
We test
2 a
2 c 2 T
SG =
Ng g G p g
2N 2 p
, n=
2 N SG
PG
,
2 2N g
n' =
SG
2N 2 p
2N
pP N , n' ' = G 1
g G
N ,
FSC =
2 b 2 2 b +c
G 1
and
FCT =
2 a 2 T
, FST =
2 2 a +b 2 T
We test We test
2 c 2 b
and FST by permuting haplotypes among populations and among groups. and FSC by permuting haplotypes among populations but within groups.
We test
2 c
among groups.
Methodological outlines
123
We test
2 a
FIS =
We test
2 a 2 T 2 a
.
and FIS by permuting haplotypes among individuals.
2 c 2 T
Methodological outlines
124
2N
n= FST =
We test We test We test
pP
2N 2 p
P 1
2 a 2 T 2 c 2 b 2 a
, FIT =
2 2 a +b 2 T
and
FIS =
2 b 2 2 b +c
and FIT by permuting haplotypes among individuals among populations. and FIS by permuting haplotypes among individuals within populations. and FST by permuting individual genotypes among populations.
2 2 2 n b + 2 c + d
2 2 2 c + d
2 d 2 T
Methodological outlines
125
2 2N g
2N n=
Ng g G p g P G
2N 2 p , n' =
g G
(N N g ) Ng
p g
2N 2 p
, n' ' =
2N
g G
N (G 1)
2 c 2 2 c + d
N G 1
2 b 2 2 2 b + c + d
FCT =
We test We test We test groups.
2 a 2 T
, FIT =
2 2 2 a +b +c 2 T
, FIS =
and
FSC =
2 d 2 c 2 b
and FIT by permuting haplotypes among populations and among groups. and FIS by permuting haplotypes among individuals within populations. and FSC by permuting individual genotypes among populations but within
We test
2 a
Methodological outlines
126
FST
P = ni FST i i =1
ni ,
i =1
where ni is the number of gene copies sampled in the i-th population. Following on that, we propose to use as population specific value for the i-th population the quantity:
FST i =
1 1 1 N SSD ( AP ) SSD(WPi ) n P 1 ni N P
2 T
which satisfies the above equation. We assume here that there is a single hierarchical level, with genes within populations. We therefore follow the notations found in section 7.2.1.1. The option to compute these population-specific FST indices is offered when a single group of population samples is defined for haplotypic or genotypic data. Intuitively, these population-specific coefficients would represent the degree of evolution of particular populations from a common ancestral population which would have split into all the demes considered in the Genetic Structure. These coefficients are provided here mainly to see if some populations do contribute differently than others to the average FST, which could be indicative of special evolutionary constraints in these populations (selection, bottleneck, etc). Note that in locus-by-locus analyses, we have noticed that populations with two alleles and one being a singleton will show large negative population-specific FST indices (which can even be smaller than -1), which is clearly an artifact because SSD(AP) will be very small while SSD(WPi) will still be substantial.
Methodological outlines
127
FST = 1 (1
1 t ) 1 et / N N
The genetic distance D = log(1 F ) is thus approximately proportional to t/N for ST short divergence times.
populations have remained isolated ever since, without exchanging any migrants. Under such conditions, FST can be expressed in terms of the coalescence times t , which is the 1 mean coalescence time of two genes drawn from two different populations, and t which 0 is the mean coalescence time of two genes drawn from the same population. Using the analysis of variance approach, the FST's are expressed as
t t FST = 1 0 t1
Because, t is equal to N generations (see e.g. Hudson, 1990), and t is equal to 0 1 generations, the above expression reduces to
+N
FST =
+N
Therefore, the ratio D = F /(1 F ) is equal to ST ST the divergence time between the two populations.
Methodological outlines
128
FST =
1 2M + 1
Therefore, M, which is the absolute number of migrants exchanged between the two populations, can be estimated by
M=
1 FST 2 FST
If one was to consider that the two populations only exchange with each other and with no other populations, then one should divide the quantity M by a factor 2 to obtain an estimator M' = Nm for haploid populations, or M'= 2Nm for diploid populations. This is because the expectation of FST is indeed given by
FST =
1
4 Nmd ( d 1)
+1
where d is the number of demes exchanging genes. When d is large this tends towards the classical value 1/(4Nm +1), but when d=2, then the expectation of FST is 1/(8Nm+1).
D = 12 =
i =1 j =1
x1i x2 j ij , and
,
k'
D A = 12
1 + 2 2
where k and k' are the number of distinct haplotypes in populations 1 and 2 respectively, x1i is the frequency of the i-th haplotype in population 1, and ij is the number of differences between haplotype i and haplotype j. Under the same notation concerning coalescence times as described above, the expectation of DA is
DA = 2 ( t1 t0 ) = 2 ,
where is the average mutation rate per nucleotide, is the divergence time between the two populations. Thus DA is also expected to increase linearly with divergence times between the populations.
Methodological outlines
129
0 = 4N0u for diploid populations), as well as the relative sizes (k and [1-k]) of the
two daughter populations. The estimated parameters result from the numerical resolution of a system of three nonlinear equations with three unknowns, based on the Broyden method (Press et al. 1992, p.389). The significance of the parameters is tested by a permutation procedure similar tot that used in AMOVA. Under the hypothesis that the two populations are undifferentiated, we permute individuals between samples, and re-estimate the three parameters, in order to obtain their empirical null distribution. The percentile value of the three statistics is obtained by the proportion of permuted cases that produce statistics larger or equal to those observed. It thus provides a percentile value of the three statistics under the null hypothesis of no differentiation. The values of the estimated parameters should be interpreted with caution. The procedure we have implemented is based on the comparison of intra and inter-population diversities ( s) which have a large variance, which means that for short divergence times, the average diversity found within population could be larger than that observed between populations. This situation could lead to negative divergence times and to daughter population relative size larger than one or smaller than zero (negative values). Also large departures from the assumed pure-fission model could also lead to observed diversities that would lead to aberrant estimators of divergence time and relative population sizes. One should thus make those computations if the assumptions of a pure fission model are met and if the divergence time is relatively old. Simulation results have shown that this procedure leads to better results than other methods that do not take unequal population sizes into account when the relative sizes of daughter populations are indeed unequal. According to our simulations (Table 4 in Gaggiotti and Excoffier 2000) conventional methods such as described above lead to better results for equal population
Methodological outlines
130
size (k=0.5) and short divergence times (T/N0<0.5). However, the fact that the present method leads to clearly aberrant results in some cases is not necessarily a drawback. It has the advantage to draw the user attention to the fact that some care has to be taken with the interpretations of the results. Some other estimators that would be grossly biased but whose values would be kept within reasonable bounds would often lead to misinterpretations. Note that the numerical method we have used to resolve the system of equation may sometimes fail to converge. An asterisk will indicate those cases in the result file that should be discarded because of convergence failure.
Methodological outlines
131
output tables can be used to represent log-log plots of genotypes for pairs of populations likelihood (see Paetkau et al. 1997 and Waser and Strobeck 1998), to identify those genotypes that seem better explained by belonging to another population from that they were sampled. For instance we have plotted on this graph the log-likelihood of individuals sampled in
-25
-30
Algeria (white circles) for two HLA class II loci versus those of Senegalese Mandenka individuals (black diamonds). The overlap of the two distribution suggests that two loci are not enough to provide a clear cut separation between these two populations. One also sees that there is at least one Mandenka individual whose genotype would be much better explained if it came from the Algerian
-5 -10 -15 -20 -25 -30
-20
-15
-10
-5
0 0
population than if it came from Eastern Senegal. Note that interpreting these results in terms of gene flow is difficult and hazardous.
rXY =
SP( X, Y) SS ( X) . SS (Y)
the ratio of the cross product of X and Y over the square root of the product of sums of squares. We note that the denominator of the above equation is insensitive to
Methodological outlines
132
permutation, such that only the numerator will change upon permutation of rows and columns. Upon closer examination, it can be shown that the only quantity that will actually change between permutations is the Hadamard product of the two matrices noted as
Z XY = X * Y =
i =1 j =1
xij
yij
which is the only variable term involved in the computation of the cross-product. The Mantel testing procedure applied to two matrices will then consist in computing the quantity ZXY from the original matrices, permute the rows and column of one matrix while keeping the other constant, and each time recompute the quantity compare it to the original ZXY value (Smouse et al. 1986). In the case of three matrices, say Y, X1 and X2, the procedure is very similar. The partial correlation coefficients are obtained from the pairwise correlations as,
Z * , and XY
rY X
1. X 2
rYX rX
1
1 X 2 YX 2
(1 r 2 X
1X 2
2 )(1 rYX )
2
The other relevant partial correlations can be obtained similarly (see e.g. Sokal and Rohlf 1981). The significance of the partial correlations are tested by keeping one matrix constant and permuting the rows and columns of the other two matrices, recomputing each time the new partial correlations and comparing it to the observation (Smouse et al. 1986). Applications of the Mantel test in anthropology and genetics can be found in Smouse and Long (1992).
References
133
8 REFERENCES
Abramovitz, M., and I. A. Stegun, 1970 Handbook of Mathematical Functions. Dover, New York. Aris-Brosou, S., and L. Excoffier, 1996 The impact of population expansion and mutation rate heterogeneity on DNA sequence polymorphism. Mol. Biol. Evol. 13: 494-504. Cavalli-Sforza, L. L., and W. F. Bodmer, 1971 The Genetics of Human Populations. W.H. Freeman and Co., San Francisco, CA. Chakraborty, R. 1990 Mitochondrial DNA polymorphism reveals hidden heterogeneity within some Asian populations. Am. J. Hum. Genet. 47:87-94. Chakraborty, R., and K. M. Weiss, 1991 Genetic variation of the mitochondrial DNA genome in American Indians is at mutation-drift equilibrium. Am. J. Hum. Genet. 86: 497-506. Cockerham, C. C., 1969 Variance of gene frequencies. Evolution 23: 72-83. Cockerham, C. C., 1973 Analysis of gene frequencies. Genetics 74: 679-700. Davies N, Villablanca FX and Roderick GK, 1999. Determining the source of individuals: multilocus genotyping in nonequilibrium population genetics. TREE 14:17-21. Dempster, A., N. Laird and D. Rubin, 1977 Maximum likelihood estimation from incomplete data via the EM algorithm. J Roy Statist Soc 39: 1-38. Efron, B. 1982 The Jacknife, the Bootstrap and other Resampling Plans. Regional Conference Series in Applied Mathematics, Philadelphia:. Efron, B., and R. J. Tibshirani. 1993. An Introduction to the Bootstrap. Chapman and Hall, London. Ewens, W.J. 1972 The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3:87-112. Ewens, W.J. 1977. Population genetics theory in relation to the neutralist-selectionist controversy. In: Advances in human genetics, edited by Harris, H. and Hirschhorn, K.New York:Plenum Press,p. 67-134. Excoffier L. 2003. Analysis of Population Subdivision. In: Balding D, Bishop M, Cannings C, editors. Handbook of Statistical Genetics, 2nd Edition. New York: John Wiley & Sons, Ltd. pp. 713-750. Excoffier L. 2004. Patterns of DNA sequence diversity and genetic structure after a range expansion: lessons from the infinite-island model. Mol Ecol 13(4): 853-864.
References
134
Excoffier, L., Smouse, P., and Quattro, J. 1992 Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data. Genetics 131:479-491. Excoffier, L., and P. Smouse, 1994. Using allele frequencies and geographic subdivision to reconstruct gene genealogies within a species. Molecular variance parsimony. Genetics 136, 343-59. Excoffier, L. and M. Slatkin. 1995 Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12:921-927 Excoffier, L., and M. Slatkin, 1998 Incorporating genotypes of relatives into a test of linkage disequilibrium. Am. J. Hum. Genet. 171-180 Excoffier L, Laval G, Balding D. 2003. Gametic phase estimation over large genomic regions using an adaptive window approach. Human Genomics 1: 7-19. Fu, Y.-X. (1997) Statistical tests of neutrality of mutations against population growth, hitchhiking and backgroud selection. Genetics 147:915-925. Gaggiotti, O., and L. Excoffier, 2000. A simple method of removing the effect of a bottleneck and unequal population sizes on pairwise genetic distances. Proceedings of the Royal Society London B 267: 81-87. Goudet, J., M. Raymond, T. de Mees and F. Rousset, 1996 Testing differentiation in diploid populations. Genetics 144: 1933-1940. Guo, S. and Thompson, E. 1992 Performing the exact test of Hardy-Weinberg proportion for multiple alleles. Biometrics 48:361-372. Harpending, R. C., 1994 Signature of ancient population growth in a low-resolution mitochondrial DNA mismatch distribution. Hum. Biol. 66: 591-600. Hudson, R. R., 1990 Gene genealogies and the coalescent proces, pp. 1-44 in Oxford Surveys in Evolutionary Biology, edited by Futuyama, and J. D. Antonovics. Oxford University Press, New York. Jin, L., and Nei M. (1990) Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol. Biol. Evol. 7:82-102. Jukes, T. and Cantor, C. 1969 Evolution of protein molecules. In: Mammalian Protein Metabolism, edited by Munro HN, New York:Academic press, p. 21-132. Kimura, M. 1980 A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120. Kruskal, J. B., 1956. On the shortest spanning subtree of a graph and the travelling salesman problem. Proc. Amer. Math. Soc. 7:48-50.
References
135
Kumar, S., Tamura, K., and M. Nei. 1993 MEGA, Molecular Evolutionary Genetic Analysis ver 1.0.The Pennsylvania State University, University Park, PA 16802. Lange, K., 1997 Mathematical and Statistical Methods for Genetic Analysis. Springer, New York. Levene H. (1949). On a matching problem arising in genetics. Annals of Mathematical Statistics 20, 91-94. Lewontin, R. C. (1964) The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49: 49-67. Lewontin, R. C., and K. Kojima. (1960) The evolutionary dynamics of complex polymorphisms. Evolution 14: 450-472. Li, W.H. (1977) Distribution of nucleotide differences between two randomly chosen cistrons in a finite population. Genetics 85:331-337. Long, J. C., 1986 The allelic correlation structure of Gainj and Kalam speaking people. I. The estimation and interpretation of Wright's F-statistics. Genetics 112: 629-647. Mantel, N. 1967. The detection of disease clustering and a generalized regression approach. Cancer Res 27:209-220. Michalakis, Y. and Excoffier, L. , 1996 A generic estimation of population subdivision using distances between alleles with special reference to microsatellite loci. Genetics 142:1061-1064. Nei, M., 1987 Molecular Evolutionary Genetics. Columbia University Press, New York, NY, USA. Nei, M., and W. H. Li. 1979. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc.Natl.Acad.Sci.USA 76:5269-5273. Paetkau D, Calvert W, Stirling I and Strobeck C, 1995. Microsatellite analysis of population structure in Canadian polar bears. Mol Ecol 4:347-54. Paetkau D, Waits LP, Clarkson PL, Craighead L and Strobeck C, 1997. An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics 147:1943-1957. Prim, R. C., 1957. Shortest connection networks and some generalizations. Bell Syst. Tech. J. 36:1389-1401. Press, W. H., S. A. Teukolsky, W. T. Vetterling and B. P. Flannery, 1992. Numerical Recipes in C: The Art of Scientific Computing. Cambridge: Cambridge University Press.
References
136
Rannala B, and Mountain JL, 1997. Detecting immigration by using multilocus genotypes. Proc.Natl.Acad.Sci.USA 94:9197-9201. Ray N, Currat M, Excoffier L. 2003. Intra-Deme Molecular Diversity in Spatially Expanding Populations. Mol Biol Evol 20(1): 76-86. Raymond M. and F. Rousset. 1994 GenePop. ver 3.0. Institut des Sciences de l'Evolution. Universit de Montpellier, France. Raymond M. and F. Rousset. 1995 An exact tes for population differentiation. Evolution 49:1280-1283. Reynolds, J., Weir, B.S., and Cockerham, C.C. 1983 Estimation for the coancestry coefficient: basis for a short-term genetic distance. Genetics 105:767-779. Rice, J.A. 1995 Mathematical Statistics and Data Analysis. 2nd ed. Duxburry Press: Belmont, CA Rogers, A., 1995 Genetic evidence for a Pleistocene population explosion. Evolution 49: 608-615. Rogers, A. R., and H. Harpending, 1992 Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9: 552-569. Rohlf, F. J., 1973. Algorithm 76. Hierarchical clustering using the minimum spanning tree. The Computer Journal 16:93-95. Rousset, F., 1996 Equilibrium values of measures of population subdivision for stepwise mutation processes. Genetics 142: 1357-1362. Rousset, F., 2000. Inferences from spatial population genetics, in Handbook of Statistical Genetics, D. Balding, M. Bishop and C. Cannings. (eds.) Wiley & Sons, Ltd., Schneider, S., and L. Excoffier. 1999. Estimation of demographic parameters from the distribution of pairwise differences when the mutation rates vary among sites: Application to human mitochondrial DNA. Genetics 152:1079-1089. Slatkin, M., 1991 Inbreeding coefficients and coalescence times. Genet. Res. Camb. 58: 167-175. Slatkin, M. 1994a Linkage disequilibrium in growing and stable populations. Genetics 137:331-336. Slatkin, M. 1994b An exact test for neutrality based on the Ewens sampling distribution. Genet. Res. 64(1):71-74. Slatkin, M. 1995 A measure of population subdivision based on microsatellite allele frequencies. Genetics 139: 457-462.
References
137
Slatkin , M. 1996 A correction to the exact test based on the Ewens sampling distribution. Genet. Res. 68: 259-260. Slatkin, M. and Excoffier, L. 1996 Testing for linkage disequilibrium in genotypic data using the EM algorithm. Heredity 76:377-383. Smouse, P. E., and J. C. Long. 1992. Matrix correlation analysis in Anthropology and Genetics. Y. Phys. Anthop. 35:187-213. Smouse, P. E., J. C. Long and R. R. Sokal. 1986. Multiple regression and correlation extensions of the Mantel Test of matrix correspondence. Systematic Zoology 35:627-632. Sokal, R. R., and F. J. Rohlf. 1981. Biometry. 2nd edition. W.H. Freeman and Co., San Francisco, CA. Stewart, F. M. 1977 Computer algorithm for obtaining a random set of allele frequencies for a locus in an equilibrium population. Genetics 86:482-483. Strobeck, K. 1987 Average number of nucleotide differences in a sample from a single subpopulation: A test for population subdivision. Genetics 117: 149-153. Tajima, F. 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437-460. Tajima, F. 1989a. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595,. Tajima, F. 1989b. The effect of change in population size on DNA polymorphism. Genetics 123:597-601,. Tajima, F. 1993. Measurement of DNA polymorphism. In: Mechanisms of Molecular Evolution. Introduction to Molecular Paleopopulation Biology, edited by Takahata, N. and Clark, A.G., Tokyo, Sunderland, MA:Japan Scientific Societies Press, Sinauer Associates, Inc., p. 37-59. Tajima, F. and Nei, M. 1984. Estimation of evolutionary distance between nucleotide sequences. Mol. Biol. Evol. 1:269-285. Tajima, F., 1996 The amount of DNA polymorphism maintained in a finite population when the neutral mutation rate varies among sites. Genetics 143: 1457-1465. Tamura, K., 1992 Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Mol. Biol. Evol. 9: 678-687. Tamura, K., and M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10: 512-526.
References
138
Uzell, T., and K. W. Corbin, 1971 Fitting discrete probability distribution to evolutionary events. Science 172: 1089-1096. Waser PM, and Strobeck C, 1998. Genetic signatures of interpopulation dispersal. TREE 43-44. Watterson, G., 1975 On the number of segregating sites in genetical models without recombination. Theor.Popul.Biol. 7: 256-276. Watterson, G. 1978. The homozygosity test of neutrality. Genetics 88:405-417 Watterson, G. A., 1986 The homozygosity test after a change in population size. genetics 112: 899-907. Weir, B. S., 1996 Genetic Data Analysis II: Methods for Discrete Population Genetic Data. Sinauer Assoc., Inc., Sunderland, MA, USA. Weir, B.S. and Cockerham, C.C. 1984 Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. Weir, B.S., and Hill, W.G. 2002. Estimating F-statistics. Annu Rev Genet 36, 721-750. Wright, S., 1951 The genetical structure of populations. Ann.Eugen. 15: 323-354. Wright, S., 1965 The interpretation of population structure by F-statistics with special regard to systems of mating. Evol 19: 395-420. Zouros, E., 1979 Mutation rates, population sizes and amounts of electrophoretic variation of enzyme loci in natural populations. Genetics 92: 623-646.
Appendix
139
Description
Possible values
The number of different A positive integer larger than zero samples listed in the data file The type of data to be analyzed (only one type of data per project file is allowed) STANDARD, DNA, RFLP, MICROSAT, FREQUENCY
DataType
GenotypicData
Specifies if genotypic or 0 (haplotypic data), gametic data is 1 (genotypic data) available The character used to separate adjacent loci WHITESPACE, TAB, NONE, or any character other than "#", or the character specifying missing data Default: WHITESPACE 0 (gametic phase not known), 1 (known gametic phase) Default: 1 0 (co-dominant data), 1 (recessive data) Default: 0 Any string within quotation marks This string can be explicitly used in the input file to indicate the occurrence of a recessive homozygote at one or several loci. Default: "null" "?" or any character within quotes, other than those previously used
LocusSeparator
GameticPhase
Specifies if the gametic phase is known (for genotypic data only) Specifies whether recessive alleles are present at all loci (for genotypic data) Specifies the code for the recessive allele
RecessiveData
RecessiveAllele
MissingData
Appendix
Default: "?"
140
ABS (absolute values), REL (relative values: absolute values will be found by multiplying the relative frequencies by the sample sizes) Default: ABS Possible values
Keywords [Data]
Description
[[HaplotypeDefinition]] (facultative section) HaplListName HaplList The name of a A string within quotation marks haplotype definition list The list of haplotypes listed within braces ({...}) A series of haplotype definitions given on separate lines for each haplotype. Each haplotype is defined by a haplotype label and a combination of alleles at different loci. The Keyword EXTERN followed by a string within quotation marks may be used to specify that a given haplotype list is in a different file Possible values
Description
(facultative section) The name of the distance matrix The size of the matrix A string within quotation marks A positive integer larger than zero (corresponding to the number of haplotypes listed in the haplotype list) ROW (the haplotype labels will be entered consecutively on one or several lines, within the MatrixData segment, before the distance matrix elements), COLUMN (the haplotype labels will be entered as the first column of each row of the distance matrix itself ) The matrix data will be entered as a format-free lower-diagonal matrix. The haplotype labels can be either entered consecutively on one or several lines (if LabelPosition=ROW), or entered at the
LabelPosition
MatrixData
Appendix
141
first column of each row (if labelPosition=COLUMN). The special keyword EXTERN may be used followed by a file name within quotation marks, stating that the data must be read in an another file
Description
Possible values
The name of the sample. This keyword is used to mark the beginning of a sample definition Specifies the sample size
SampleSize
An integer larger than zero. For haplotypic data, it must specify the number of gene copies in the sample. For genotypic data, it must specify the number of individuals in the sample.
SampleData
The keyword EXTERN may be used followed by a file name within quotation marks, stating that the data must be read in a separate file. The SampleData keyword ends a sample definition
Appendix
142
Description
Possible values
(facultative section) The name of a given genetic structure to test The number of groups of populations The definition of a group of samples, identified by their SampleName listed within braces ({...}) Description (facultative section) Allows computing the (partial) correlation between YMatrix and X1 (X2). The size of the matrix entered into the project An integer larger than zero A string of characters within quotation marks An integer larger than zero A series of strings within quotation marks all enclosed within braces, and, if desired, on separate lines
Possible values
MatrixSize YMatrix
Specifies which matrix is "fst", "log_fst", "slatkinlinearfst", used as YMatrix. "log_slatkinlinearfst", "nm", "custom" Number of matrices to be compared with the YMatrix. 1 :we compute the correlation between YMatrix and X1 2 :we compute the partial correlation between YMatrix, X1and X2
MatrixNumber
YMatrixLabels
Labels to identify the A series of strings within quotation marks all entries of the YMatrix. In enclosed within braces, and, if desired, on case of YMatrix=fst, separate lines these labels should correspond to population names in the sample. A keyword used to define a matrix, which can be either the Ymatrix, or another matrix that will be compared with the The matrix data will be entered as a formatfree lower-diagonal matrix.
DistMatMantel
Appendix
143
UsedYMatrixLabels Labels defining the sub- A series of strings within quotation marks all matrix of the YMatrix on enclosed within braces, and, if desired, on which the correlation is separate lines computed.