jModelTest 2 Manual v0.1.11
jModelTest 2 Manual v0.1.11
11
Contents
1 Overview 2
1.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Citation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Disclaimer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Last Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Getting Started 5
2.1 Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Working with the repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 User interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 High Performance Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Global Configuration 18
5.1 Logging properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 PhyML binary properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Hybrid shared-distributed memory execution properties . . . . . . . . . . . . . . . . . . . . . . . 20
7 Theoretical Background 23
7.1 Models of nucleotide substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.2 Sequential Likelihood Ratio Tests (sLRT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.3 Hierarchical Likelihood Ratio Tests (hLRT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.4 Dynamical Likelihood Ratio Tests (dLRT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.5 Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.6 Model Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.7 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.8 Model Averaged Phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.9 Parameter Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1
1 Overview
jModelTest is a tool to carry out statistical selection of best-fit models of nucleotide substitution. It implements
five different model selection strategies: hierarchical and dynamical likelihood ratio tests (hLRT and dLRT),
Akaike and Bayesian information criteria (AIC and BIC), and a decision theory method (DT). It also provides
estimates of model selection uncertainty, parameter importances and model-averaged parameter estimates,
including model-averaged tree topologies. jModelTest 2 includes High Performance Computing (HPC) capa-
bilities and additional features like new strategies for tree optimization, model-averaged phylogenetic trees
(both topology and branch lenght), heuristic filtering and automatic logging of user activity.
In 2020, jModelTest was superseded by ModelTest-NG, available at https://github.com/ddarriba/modeltest.
1.1 Download
The main project webpage is located at GitHub: https://github.com/ddarriba/jmodeltest2.
New distributions of jModelTest will be hosted in GitHub releases.
• https://github.com/ddarriba/jmodeltest2/releases
Please use the jModelTest discussion group for any question:
• http://groups.google.com/group/jmodeltest.
1.2 Citation
When using jModelTest you should cite all these:
• Darriba D, Taboada GL, Doallo R, Posada D. 2012. jModelTest 2: more models, new heuristics and
parallel computing. Nature Methods 9(8), 772.
• Guindon S and Gascuel O (2003). A simple, fast and accurate method to estimate large phylogenies by
maximum-likelihood”. Systematic Biology 52: 696-704.
1.3 Disclaimer
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published
by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. This program is distributed in the
hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU
General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
MA 02111-1307, USA. The jModelTest distribution includes Phyml executables.
These programs are protected by their own license and conditions, and using jModelTest implies agreeing with those conditions as
well.
2
1.4 Last Updates
• 3 Mar 2016 Version 2.1.10 Revision 20160303
– Fixed bug with sequences where the 8-char name prefixes are equal
– Added warning when the logging is disabled on runtime
– Added win32 PhyML binary version to compatibility list
• 15 Jan 2016 - Version 2.1.9
– Added automatic search for PhyML binary in /usr/bin
– Removed non-ASCII characters
– Disable logging if writing is not possible
– Merge GUI images into jarfile
• 20 Oct 2015 - Version 2.1.8
– Removed ReadSeq dependency
– Fixed warnings
– Updated prottest jarfile to v3.4
• 20 Feb 2015 - Version 2.1.7
– Fixed bug in ML tree search operation. Console version was using NNI moves instead of ”BEST” by default.
• 20 Nov 2014 - Version 2.1.7
– Fixed bug with special characters in paths
– Added initial check of PhyML binaries
– Added notification in case AICc produces negative values
• 06 Aug 2014 - Version 2.1.6
– Added confirmation window when cancelling running jobs in the GUI
– Added automatic checkpointing files generation
– Added “-ckp” argument for loading checkpointing files
• 05 Apr 2014 - Version 2.1.5
– Updated OS X binary
– Fixed bug with computation of JC model for “fixed” topology
– Fixed bug with DT criterion computation
– Added “-n” argument for naming executions (the name is included in the log filenames)
– Added “-getphylip” argument for converting alignments into PHYLIP format with ALTER
– Fixed bug in PhyML logging in GUI. Added a unique ID for every model in the log file
– Added PAUP* block into log files if required (“-w” argument)
– Added more verbose error messages
• 10 Jul 2013 - Version 2.1.4
– Added phyml auto-logging.
– Added phyml command lines for best-fit models.
– Added phyml log tab in the GUI.
– Removed sample size modes (and “-n” argument). Sample size is fixed to alignment size.
– Fixed bug with relative paths when calling from a different path.
– Fixed typos in the GUI.
• 05 Mar 2013 - Version 2.1.3
– Fixed bug with PAUP‘*‘ command block.
– Added the possibility to change Inforation Criterion used with the clustering algorithm for the 203 matrices.
– Changed “-o” argument for the hypothesis order into “-O”
– Added “-o” argument for forwarding the standard output to a file: -o FILENAME
• 01 Jan 2013 Version 2.1.2 - Revision 20130103
– Fixed bug in paths with whitespaces.
– Updated PhyML binaries.
• 31 Jul 2012 Version 2.1.1 - Revision 20120731
– Fixed bug with hLRT selection when attempting to use a user-defined topology.
• 11 Mar 2012 Version 2.1 - Revision 20120511
– Major updates:
3
* Exhaustive GTR submodels: All the 203 different partitions of the GTR rate matrix can be included in the candidate set
of models. When combined with rate variation (+I,+G, +I+G) and equal/unequal base frequencies the total number of
possible models is 203 x 8 = 1624.
* Hill climbing hierarchical clustering: Calculating the likelihood score for a large number of models can be extremely
time-consuming. This hill-climbing algorithm implements a hierarchical clustering to search for the best-fit models
within the full set of 1624 models, but optimizing at most 288 models while maintaining model selection accuracy.
* Heuristic filtering: Heuristic reduction of the candidate models set based on a similarity filtering threshold among the
GTR rates and the estimates of among-site rate variation.
* Absolute model fit: Information criterion distances can be calculated for the best-fit model against the unconstrained
multinomial model (based on site pattern frequencies). This is computed by default when the alignment does not
contain missing data/ambiguities, but can also be approximated otherwise.
* Topological summary: Tree topologies supported by the different candidate models are summarized in the html log,
including confidence intervals constructed from cumulative models weights, plus Robinson-Foulds and Euclidean dis-
tances to the best-fit tree for each.
– Minor updates:
* Corrected a bug in the fixed BIONJ-JC starting topology. F81+I+G was executed instead of JC.
* “Best” is now the default tree search operation instead of NNI. “Best” computes both NNI and SPR algorithms and
selects the best of them.
* User can select the number of threads from GUI.
• 1 Feb 2012 - Version 2.0.2
– Added conf/jmodeltest.conf file, where you can: Enable/Disable the automatic logging:
You might be running a huge dataset and you don’t want to generate hundreds or thousands of log files.
Set the PhyML binaries location:
If you already have installed PhyML in your machine, you can setup jModelTest for use your own binaries.
– Enhanced the html log output.
4
2 Getting Started
2.1 Operating Systems
Since jModelTest is a Java application, it can be used in every OS that can execute a Java Runtime Environment
(JRE). The most common Operating Systems and many other include a JRE (OpenJDK, Sun JRE, ...), or at least
it is possible to download one. However, jModelTest depends on third-party binaries (PhyML), that are dis-
tributed for Windows, Linux and OsX, and it is even possible to download PhyML sources (http://code.google.com/p/phym
and compile them for a particular architecture.
1. Execute the script for the Graphical User Interface (runjmodeltest-gui.sh). The main jModelTest frame
should pop up on the screen:
5
3. Go to Analysis/Compute Likelihood Scores and select the candidate models and the options for model
optimization (optionally you can set a base topology from a file). Press Enter or the Compute Likeli-
hoods button.
4. Perform statistical selection among the optimized models. For example, we can calculate the Bayesian
Information Criterion using Analysis/Do BIC calculations... option, or any other. You can find a Criteria
comparison in terms of accuracy in the supplementary material of the jModelTest publication.
6
6. Build a consensus tree from a given selection criteria using Analysis/Model-averaged phylogeny:
7. Finally, you can save the results displayed in the main console using Edit/Save console. Alternatively,
you can get a formatted HTML document using Results/Build HTML log:
7
Take a look at Section 3 for further information.
This will test all 88 models (gamma models with 4 rate categories), and then perform the model selection
using Akaike (AIC) and Bayesian (BIC) criteria, calculating also a model averaged phylogeny (-a).
See Section 4 for information about supported arguments.
2. This will generate the following output:
(a) Header:
−−−−−−−−−−−−−−−−−−−−−−−−−−−−− j M o d e l t e s t 2 . 0 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−
( c ) 2011−onwards Diego Darriba , David Posada ,
Department o f Biochemistry , G e n e t i c s and Immunology
U n i v e r s i t y o f Vigo , 36310 Vigo , Spain . e−mail : ddarriba@udc . es , dposada@uvigo . es
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Wed Oct 05 1 2 : 5 6 : 4 7 CEST 2011
Linux 2.6.38 −11 − g e n e r i c −pae , arch : i 3 8 6 , b i t s : 3 2 , numcores : 2
8
(b) Execution options:
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
* *
* COMPUTATION OF LIKELIHOOD SCORES WITH PHYML *
* *
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
: : Settings : :
Phyml v e r s i o n = 3 . 0
Phyml b i n a r y = PhyML 3 . 0 l i n u x 3 2
Candidate models = 24
number o f s u b s t i t u t i o n schemes = 3
i n c l u d i n g models with equal/unequal base f r e q u e n c i e s (+ F )
i n c l u d i n g models with/without a p r o p o r t i o n o f i n v a r i a b l e s i t e s (+ I )
i n c l u d i n g models with/without r a t e v a r i a t i o n among s i t e s (+G) ( nCat = 4 )
Optimized f r e e parameters (K) = s u b s t i t u t i o n parameters + 9 branch l e n g t h s + topology
Base t r e e f o r l i k e l i h o o d c a l c u l a t i o n s = ML t r e e
Tree topology s e a r c h o p e r a t i o n = NNI
computing l i k e l i h o o d s c o r e s f o r 24 models with Phyml 3 . 0
: : Progress : :
Model = JC
p a r t i t i o n = 000000
−lnL = 1 1 1 4 . 9 7 7 2
K = 10
Model = JC+ I
p a r t i t i o n = 000000
−lnL = 1 1 0 3 . 1 1 1 3
K = 11
p−inv = 0 . 9 0 8 0
...
Model = GTR+ I +G
p a r t i t i o n = 012345
−lnL = 1 0 5 1 . 8 4 0 3
K = 20
freqA = 0 . 4 2 3 5
freqC = 0 . 1 5 2 0
freqG = 0 . 2 0 2 2
freqT = 0.2224
R( a ) [AC] = 0 . 8 7 0 9
R( b ) [AG] = 0 . 4 1 5 2
R( c ) [AT] = 0 . 6 0 4 9
R( d ) [CG] = 1 . 2 5 2 3
R( e ) [CT] = 0 . 9 4 8 2
R( f ) [GT] = 1 . 0 0 0 0
p−inv = 0 . 5 9 4 0
gamma shape = 0 . 0 1 2 0
9
Computation o f l i k e l i h o o d s c o r e s completed . I t took 00h : 0 0 : 0 7 : 0 5 .
(e) Selected Information Criteria (best model and all models sorted according to each criterion):
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
* *
* AKAIKE INFORMATION CRITERION ( AIC ) *
* *
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Model s e l e c t e d :
Model = F81+ I
p a r t i t i o n = 000000
−lnL = 1 0 5 3 . 5 4 2 8
K = 14
freqA = 0 . 4 2 0 0
freqC = 0 . 1 5 5 8
freqG = 0 . 2 0 1 5
freqT = 0.2227
p−inv = 0 . 9 0 3 0
(f) Consensus tree of the optimized phylogenies using the criterion weights:
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
* *
* MODEL AVERAGED PHYLOGENY *
* *
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
S e l e c t i o n c r i t e r i o n : . . . . AIC
Confidence i n t e r v a l : . . . . 1 . 0 0
Consensus type : . . . . . . . 50% m a j o r i t y r u l e
B i p a r t i t i o n s included i n t h e consensus t r e e
123456
* * * * * * ( 1.0 )
**** − − ( 1 . 0 )
**−−−− ( 0 . 9 4 2 4 4 )
10
−−**−− ( 1 . 0 )
+−−−−−−−−−−−6 P4
+−8
| +−−−−−−−−−−−−−−−−5 P5
+−−−−−−−−−−−−−−−−−−−−−9
| | +−4 P1
| +−−7
| +−−−−−−−−−−3 P6
|
+−−−−−−2 P2
|
+−−−−−−−−−−−−−−−−−−−−−−−−−−−1 P3
( P3 : 0 . 0 1 6 6 1 3 , P2 : 0 . 0 0 4 5 9 8 , ( ( P6 : 0 . 0 0 6 7 9 0 , P1 : 0 . 0 0 0 0 0 0 ) 1 . 0 0 : 0 . 0 0 2 0 4 6 , ( P5 : 0 . 0 1 0 1 9 1 , P4 : 0 . 0 0 7 1 9 8 )
0.94:0.001510) 1.00:0.012665) ;
$ e x p o r t $JMODELTEST HOME=[ p a t h t o j M o d e l T e s t ]
$ cd $JMODELTEST HOME
$ t a r zvxf mpj . t a r . gz
$ e x p o r t MPJ HOME=$JMODELTEST HOME/mpj
$ e x p o r t PATH=$MPJ HOME/bin : $PATH
$ cp $JMODELTEST HOME/ e x t r a /machines $JMODELTEST HOME
You can also add the last two lines to /.bashrc to automatically set these variables at console startup.
2. $JMODELTEST HOME/machines file contains the set of computing nodes where the mpj processes will
be executed. By default it points to the localhost machine, so you should change it if you want to run a
parallel execution over a cluster machine, just writing on each line the particular computing nodes (e.g.
see filecluster8.conf.template).
3. Start the MPJ Express daemons:
$ mpjboot machines
The application “mpjboot” should be in the execution path (it is located at $MPJ HOME/bin). A ssh
service must be running in the machines listed in the machines file. Moreover, port 10000 should be free.
For more details refer to the MPJ Express documentation.
4. Run jModelTest. For this, the jModelTest distribution provides a bash script: ’runjmodeltest-cluster.sh’
The basic syntax is:
./runjmodeltest-cluster.sh $NUMBER OF PROCESSORS $APPLICATION PARAMETERS
11
$ . / r u n j m o d e l t e s t −c l u s t e r . sh 2 −d example−data/aP6 . f a s −s 11 −i −g 4 −f −AIC −a
12
3 Graphical User Interface
The main distribution includes a script for launching the interface, runjmodeltest-gui.sh, located under the jMod-
elTest home folder. Other possibility is running the following command line:
$ j a v a − j a r jModelTest . j a r
Moreover, in Windows and MacOS X, it is often possible to double-click the jModelTest.jar file to launch
the graphical interface.
13
3.2 Menu description
Menu Submenu Description Enabled
File
Load alignment Load an input alignment
Load checkpoint file Load a previous snapshot a (i)
Quit Exit the program
Analysis
Compute likelihood scores Optimize the set of candidate models (i)
Do AIC calculations Calculate Akaike Information Criterion (ii)
Do BIC calculations Calculate Bayesian Information Criterion (ii)
Do DT calculations Calculate Decision Theory (ii)
Do hLRT calculations Calculate hierarchical likelihood ratio test (ii) b
Model-averaged phylogeny Calculate the consensus tree (iii & iv)
Results
Show results table Show a table with the selection results (ii)
Build HTML log Create an html webpage with the results (ii)
Tools
LRT calculator Likelihood Ratio Test for nexted models
(i) After loading an alignment (ii) After computing the likelihood scores (iii) If the base tree is not fixed (iv) After calculating
an Information Criterion
a See Section 6.3
b This test is only available for 3,5,7 and 11 substitution schemes and for fixed topologies (fixed BIONJ-JC tree or user-defined topology)
14
4 Command Line Arguments
• -a
Estimate model-averaged phylogeny for each active criterion. See Section 7.8 for more details.
• -AIC
Calculate the Akaike Information Criterion. See Section 7.5.1.
• -AICc
Calculate the corrected Akaike Information Criterion. See Section 7.5.1.
• -BIC
Calculate the Bayesian Information Criterion. See Section 7.5.2.
• -DT
Calculate the decision theory criterion. See Section 7.5.3.
• -c confidenceInterval
Sets the confidence interval for the model selection process (default is 100).
• -d inputFile
Sets the input data file. jModelTest makes use of the ALTER library for converting several alignment
formats to PHYLIP.
• -dLRT
Perform dynamical likelihood ratio tests. See Section 7.4 for more details.
• -f
Include models with unequals base frecuencies.
• -g numberOfRateCategories
Include models with rate variation among sites and sets the number of categories. Usually 4 categories
are enough.
• -getPhylip
Converts the input file into phylip format and exits. For example, the following command will generate
a new PHYLIP file named “input.nex.phy”.
$ j a v a − j a r jModelTest . j a r −d i np ut . nex −g e t P h y l i p
• -G threshold
Heuristic search. Requires a threshold ¿ 0 (e.g., -G 0.1)
• -h confidenceInterval
Sets the confidence level for the hLRTs (default is 0.01)
• -help
Displays a help message
• -hLRT
Perform hierarchical likelihood ratio tests. See Section 7.3 for more details.
• -H
Information criterion for clustering search (AIC, AICc, BIC). (e.g., -H AIC) (default is BIC)
• -i
Include models with a proportion invariable sites.
• -machinesfile machinesFile
Gets the processors per host from a machines file (for MPI execution).
15
• -n logSuffix
Execution name appended to the log filenames. By default, current time is used: yyyyMMddhhmmss.
• -o outputFile
Redirects the output to a file.
• -O ftvwxgp
Sets the hypothesis order for the hLRTs (e.g., -hLRT -O gpftv) (default is ftvwxgp)
– f frequencies
– t transition/transversion ratio
– v 2ti4tv for subst=3 / 2ti for subst¿3
– w 2tv
– x 4tv
– g gamma
– p proportion of invariable sites
• -p
Calculate the parameter importances. See Section 7.9.
• -r
Backward selection for the hLRT (default is forward).
• -s 3—5—7—11—203
Sets the number of substitution schemes.
• -S NNI—SPR—BEST
Defines the tree topology search operation option for Maximum-Likelihood search:
• –set-local-config configFile
Allows the user to set a local configuration file in replacement of conf/jmodeltest.conf. See Section 5 for
more details.
• –set-property propertyName=propertyValue
Allows the user to set a especific value for a property in replacement of the existing parameter in conf/j-
modeltest.conf. See Section 5 for more details.
e.g., –set-property log-dir=myHome/myLogDirectory
• -t fixed—BIONJ—ML
Base tree for likelihood calculations (e.g., -t BIONJ):
16
• -tr numberOfThreads
Number of threads to execute (default is the number of logical processors in the machine).
• -u treeFile
Fixed tree for likelihood calculations defined by the user. If a user tree is defined with this command, -t
argument is ignored.
• -uLnL
Calculate delta AIC,AICc,BIC against unconstrained likelihood.
• -v
Do model averaging and parameter importances. See Section 7.7.
• -w
Prints out the PAUP block.
• -z
Strict consensus type for model-averaged phylogeny (default is majority rule). See Section 7.8.
17
5 Global Configuration
jModelTest contains some global configuration parameters in the file conf/jmodeltest.conf. In case you are
sharing the jModelTest distribution between multiple users, it is possible to set a local configuration file for
your own using the --set-local-config argument in the command file. You can also change one or several
properties by using the --set-property argument. See Page 16 for the reference about this commands.
For example:
#######################################
# jModelTest Configuration properties #
#######################################
##########################################################
# #
# Automatic Logging #
# #
# If html-logging is "enabled", every time the user runs #
# jModelTest, a new html log file will be created in the #
# log directory. #
# If phyml-logging is "enabled", PhyML streams are saved #
# Default log directory is $JMODELTEST_HOME/log, but can #
# be modified using the log-dir property. #
# #
##########################################################
checkpointing = enabled
html-logging = enabled
phyml-logging = enabled
log-dir = log
##########################################################
# #
# Phyml Binaries path #
# #
# By default, jModelTest will search for the PhyML #
# executables in $JMODELTEST_HOME/exe/phyml. User can #
# define a different path, wether absolute (starting #
# with ’/’ or ’C:\’) or relative to $JMODELTEST_HOME #
# directory using exe-dir property. #
# #
# If an usable version of PhyML is installed system-wide #
# (for example, from the Ubuntu/Debian repositories), #
# the user can set ’global-phyml-exe’ property to true #
# and jModelTest will use the global binary instead of #
# local ones. #
# #
##########################################################
global-phyml-exe = false
exe-dir = exe/phyml
##########################################################
# #
# Thread Scheduling Configuration #
# #
# Properties below are specific properties for the #
# thread scheduling behavior. Those are the default #
# number of threads for executing each sort of model. #
18
# #
# If the specified number of threads is higher than the #
# total number of cores in the machine, the whole #
# machine will be used for that models. #
# #
##########################################################
gamma-threads = 4
inv-threads = 2
uniform-threads = 1
• PhyML logs. The output of PhyML for every model optimization. This files are useful when an error
occurs during the model optimization.
• HTML logs. The results of the model selection in html format. This provides an easy visualization of the
results.
• checkpoint files. Snapshots are stored during the execution, and these can be used for restoring a previ-
ous run at the last stable point.
Using the properties one can enable or disable each log independently. If there is no log-dir, all logs will
be disabled independently on their value. Thus, for example a system administrator could comment that
property in the configuration file, and users can set their own log directory in the command line:
checkpointing = enabled
html-logging = enabled
phyml-logging = enabled
# log-dir = log <-- This line is commented
Now setting the custom log directory will generate all the log files there, as long as those are enabled in the
configuration file. Otherwise, no log will be generated.
TIP: A workaround for setting the global PhyML executable in case that its name differs from “phyml” is
creating a symbolic link in the binaries directory. For example:
$ cd $JMODELTEST HOME/exe/phyml
$ l n −s phyml ‘ which $MY PHYML GLOBAL EXECUTABLE‘
19
5.3 Hybrid shared-distributed memory execution properties
The following properties are used only with the hybrid memory parallel execution. The dynamic shared
memory scheduler can use a different number of threads for model optimization depending on the model
parameters. For example, models with rate heterogeneity will take a longer time to optimize the parameters.
This way, assigning a different number of threads used for the parallel optimization of each model will improve
the efficiency by minimizing the parallel overhead.
Note that for enabling the hybrid memory parallelization you need to create your own PhyML binaries
applying a patch directly to the PhyML source code and compiling it for you system. You can ask us about it.
The default values work fine for a hybrid execution using 8 and 16 core-nodes.
20
6 Common Use Cases
6.1 Converting Alignment Files
jModelTest accepts several input alignment file formats. However, it makes use of the ALTER library for
converting them into PHYLIP format, accepted by PhyML. If you want to validate your alignment, you can
convert it into PHYLIP format using the “-getPhylip” argument. It will generate a new file appending “.phy”
to the input alignment filename, and exit afterwards.
$ j a v a − j a r jModelTest −d example−data/aP6 . f a s −g e t P h y l i p
In case there is something wrong in the input file, it will exit with the description of the error.
Note that, by default, jModelTest uses Maximum-Likelihood topologies as the base trees for the model
optimization, and checks both NNI and SPR algorithms for the topology search. This obtains the most accurate
results, but it is also the most time consuming operation. According to the size of the input alignment, one can
directly select one of the algorithms saving time in the computations. As a general rule, for a small number
of taxa NNI algorithm would work better, as well as SPR is more suitable for a large number of taxa. The tree
search operation can be set with “-S” argument (e.g., -t ML -S NNI).
If no execution name was provided, it is automatically generated according to the current date and time
with the following format: yyyyMMddhhmmss (e.g., if current time is 17:05:00 August 3 2014, the execution
name is 20140803170500, and the checkpointing generated file is:
l o g /[ sequenceFileName ] . 2 0 1 4 0 8 0 3 1 7 0 5 0 0 . ckp ) .
When using the GUI instead of the command console interface, the checkpointing file can be loaded using
the menu item “File/Load checkpoint file”, that becomes enabled right after loading the alignment.
21
From the GUI, one can choose between the different number of the substituion schemes in the execution
settings window.
22
7 Theoretical Background
All phylogenetic methods make assumptions, whether explicit or implicit, about the process of DNA substitu-
tion [Felsenstein, 1988]. Consequently, all the methods of phylogenetic inference depend on their underlying
substitution models. To have confidence in inferences it is necessary to have confidence in the models [Gold-
man, 1993b]. Because of this, it makes sense to justify the use of a particular model. Statistical model selection
is one way of doing this. For a review of model selection in phylogenetics see Sullivan and Joyce [2005] and
Johnson and Omland [2003]. The strategies includes in jModelTest include sequential likelihood ratio tests
(LRTs), Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) and performance-based de-
cision theory (DT).
Table 1: Named substitution models jModelTest2 (a few of the 1624 possible). Any of these models can include
invariable sites (+I), rate variation among sites (+G), or both (+I+G).
Model Reference Free Base Substitution rates Substitution
param. freq. code
JC [Jukes and Cantor, 1969] 0 equal AC=AG=AT=CG=CT=GT 000000
F81 [Felsenstein, 1981] 3 unequal AC=AG=AT=CG=CT=GT 000000
K80 [Kimura, 1980] 1 equal AC=AT=CG=GT;AG=GT 010010
HKY [Hasegawa et al., 1985] 4 unequal AC=AT=CG=GT;AG=GT 010010
TrNef [Tamura and Nei, 1993] 2 equal AC=AT=CG=GT;AG;GT 010020
TrN [Tamura and Nei, 1993] 5 unequal AC=AT=CG=GT;AG;GT 010020
TPM1 =K81 [Kimura, 1981] 2 equal AC=GT;AG=CT;AT=CG 012210
TPM1uf [Kimura, 1981] 5 unequal AC=GT;AG=CT;AT=CG 012210
TPM2 2 equal AC=AT;CG=GT;AG=CT 010212
TPM2uf 5 unequal AC=AT;CG=GT;AG=CT 010212
TPM3 2 equal AC=AT;AG=GT;AG=CT 012012
TPM3uf 5 unequal AC=CG;AT=GT;AG=CT 012012
TIM1ef [Posada, 2003] 3 equal AC=GT;AT=CG;AG;CT 012230
TIM1 [Posada, 2003] 6 unequal AC=GT;AT=CG;AG;CT 012230
TIM2ef 3 equal AC=AT;CG=GT;AG;CT 010232
TIM2 6 unequal AC=AT;CG=GT;AG;CT 010232
TIM3ef 3 equal AC=CG;AT=GT;AG;CT 012032
TIM3 6 unequal AC=CG;AT=GT;AG;CT 012032
TVMef [Posada, 2003] 4 equal AC;CG;AT;GT;AG=CT 012314
TVM [Posada, 2003] 7 unequal AC;CG;AT;GT;AG=CT 012314
SYM [Zharkikh, 1994] 5 equal AC;CG;AT;GT;AG;CT 012345
GTR =REV [Tavaré, 1986] 8 unequal AC;CG;AT;GT;AG;CT 012345
23
7.3 Hierarchical Likelihood Ratio Tests (hLRT)
Likelihood ratio tests can be carried out sequentially by adding parameters (forward selection) to a simple
model (JC), or by removing parameters (backward selection) from a complex model (GTR+I+G) in a specific
order or hierarchy (hLRT; see Figure below). The performance of hierarchical LRTs for phylogenetic model
selection has been discussed by Posada and Buckley [2004].
Figure. Example of a particular forward hierarchy of likelihood ratio tests for 24 models. At any level the
null hypothesis (model on top) is either accepted (A) or rejected (R). In this example the model selected is
GTR+I.
24
Figure. Dynamical likelihood ratio tests for 24 models. At any level a hypothesis is either accepted (A)
or rejected (R). In this example the model selected is GTR+I. Hypotheses tested are: F = base frequencies; S =
substitution type; I = proportion of invariable sites; G = gamma rates.
The Akaike information criterion (AIC, [Akaike, 1974] is an asymptotically unbiased estimator of the Kullback-
Leibler information quantity [S. Kullback, 1951]. We can think of the AIC as the amount of information lost
when we use a specific model to approximate the real process of molecular evolution. Therefore, the model
with the smallest AIC is preferred. The AIC is computed as:
AIC = −2l + 2k
where l is the maximum log-likelihood value of the data under this model and Ki is the number of free
parameters in the model, including branch lengths if they were estimated de novo. When sample size (n) is
small compared to the number of parameters (say, Kn < 40) the use of a second order AIC, AICc [Hurvich and
Tsai, 1989; Sugiura, 1978], is recommended:
(2k(k + 1))
AICc = AIC +
(n − k − 1)
The AIC compares several candidate models simultaneously, it can be used to compare both nested and
non-nested models, and model-selection uncertainty can be easily quantified using the AIC differences and
Akaike weights (see Model uncertainty below). Burnham and Anderson [2003] provide an excellent introduc-
tion to the AIC and model selection in general.
An alternative to the use of the AIC is the Bayesian Information Criterion (BIC) [Schwarz, 1978]:
Given equal priors for all competing models, choosing the model with the smallest BIC is equivalent to
selecting the model with the maximum posterior probability. Alternatively, Bayes factors for models of molec-
ular evolution can be calculated using reversible jump MCMC [Huelsenbeck et al., 2004]. We can easily use the
BIC instead of the AIC to calculate BIC differences or BIC weights.
Minin et al. [2003] developed a novel approach that selects models on the basis of their phylogenetic perfor-
mance, measured as the expected error on branch lengths estimates weighted by their BIC. Under this decision
theoretic framework (DT) the best model is the one with that minimizes the risk function:
−BIC j
n
e 2
Ci ≈ ∑ ||B̂i − Bˆ j || −BICi
j=1 ∑Rj=1 (e 2 )
where
2t−3
||B̂i − Bˆ j ||2 = ∑ (Bˆil − Bˆjl )2
l=1
and where t is the number of taxa. Indeed, simulations suggested that models selected with this criterion
result in slightly more accurate branch length estimates than those obtained under models selected by the
hLRTs [Abdo et al., 2005; Minin et al., 2003].
25
7.6 Model Uncertainty
The AIC, Bayesian and DT methods can rank the models, allowing us to assess how confident we are in the
model selected. For these measures we could present their differences (∆). For example, for the ith model, the
AIC (BIC, DT) difference is:
∆i = AICi − min(AIC)
where min(AIC) is the smallest AIC value among all candidate models. The AIC differences are easy to
interpret and allow a quick comparison and ranking of candidate models. As a rough rule of thumb, models
having ∆i within 1-2 of the best model have substantial support and should receive consideration. Models
having ∆i within 3-7 of the best model have considerably less support, while models with ∆i > 10 have essen-
tially no support. Very conveniently, we can use these differences to obtain the relative AIC (BIC) weight (wi )
of each model:
−∆i
e 2
ωi = −∆r
∑Rr=1 (e 2 )
which can be interpreted, from a Bayesian perspective, as the probability that a model is the best approxi-
mation to the truth given the data. The weights for every model add to 1, so we can establish an approximate
95% confidence set of models for the best models by summing the weights from largest to smallest from largest
to smallest until the sum is 0.95 [Burnham and Anderson, 1998, 2003]. This interval can also be set up stochas-
tically (see above “Model selection and averaging”). Note that this equation will not work for the DT (see the
DT explanation on “Model selection and averaging”).
and
1 φA−C is in model Mi
IφA−C (Mi ) =
0 otherwise
Note that need to be careful when interpreting the relative importance of parameters. When the number
of candidate models is less than the number of possible combinations of parameters, the presence-absence of
some pairs of parameters can be correlated, and so their relative importances.
26
References
Abdo, Z., Minin, V., Joyce, P., and Sullivan, J. (2005). Accounting for uncertainty in the tree topology has little effect on the decision-theoretic approach to
model selection in phylogeny estimation. Molecular Biology and Evolution, 22, 691–703.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.
Burnham, K. and Anderson, D. (1998). Model selection and inference: a practical information-theoretic approach. Springer-Verlag, New York, NY.
Burnham, K. and Anderson, D. (2003). Model selection and multimodel inference: a practical information-theoretic approach. Springer-Verlag, New York, NY.
Felsenstein, J. (1981). Evolutionary trees from dna sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17, 368–376.
Felsenstein, J. (1988). Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics, 22, 521–565.
Goldman, N. (1993a). Simple diagnostic statistical test of models of dna substitution. Journal of Molecular Evolution, 37, 650–661.
Goldman, N. (1993b). Statistical tests of models of dna substitution. Journal of Molecular Evolution, 36, 182–198.
Goldman, N. and Whelan, S. (2000). Statistical tests of gamma-distributed rate heterogeneity in models of sequence evolution in phylogenetics. Molecular
Biology and Evolution, 17, 975–978.
Hasegawa, M., Kishino, K., and Yano, T. (1985). Dating the human-ape splitting by a molecular clock of mitochondrial dna. Journal of Molecular Evolution, 22,
160–174.
Hoeting, J., Madigan, D., and Raftery, A. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14, 382–417.
Huelsenbeck, J., Larget, B., and Alfaro, M. (2004). Bayesian phylogenetic model selection using reversible jump markov chain monte carlo. Molecular Biology
and Evolution, 21, 1123–1133.
Hurvich, C. and Tsai, C. (1989). Regression and time series model selection in small samples. Biometrika, 76, 297–307.
Johnson, J. and Omland, K. (2003). Model selection in ecology and evolution. Trends in Ecology and Evolution, 19, 101–108.
Jukes, T. and Cantor, C. (1969). Evolution of protein molecules. Academic Press, New York, NY, pages 21–132.
Kendall, M. and Stuart, A. (1979). The advanced theory of statistics. Charles Griffin, London.
Kimura, M. (1980). A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. Journal of
Molecular Evolution, 16, 111–120.
Kimura, M. (1981). Estimation of evolutionary distances between homologous nucleotide sequences. Proceedings of the National Academy of Sciences, U.S.A, 78,
454–458.
Madigan, D. and Raftery, A. (1994). Model selection and accounting for model uncertainty in graphical models using occam’s window. Journal of the American
Statistical Association, 59, 1335–1346.
Minin, V., Abdo, Z., and P. Joyce, J. S. (2003). Performance-based selection of likelihood models for phylogeny estimation. Systematic Biology, 52, 674–683.
Ohta, T. (1992). Theoretical study of near neutrality. ii. effect of subdivided population structure with local extinction and recolonization. Genetics, pages
917–923.
Posada, D. (2003). Using modeltest and paup to select a model of nucleotide substitution. pages 6.5.1–6.5.14.
Posada, D. and Buckley, T. (2004). Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches
over likelihood ratio tests. Systematic Biology, 53, 793–808.
Posada, D. and Crandall, K. (2001). Selecting the best-fit model of nucleotide substitution. Systematic Biology, 50, 580–601.
Raftery, A. (1996). Hypothesis testing and model selection. Markov chain Monte Carlo in practice. Chapman and Hall, London, pages 163–187.
S. Kullback, R. L. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 79–86.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.
Sugiura, N. (1978). Further analysis of the data by akaike’s information criterion and the finite corrections. Communications in Statistics–Theory and Methods,
A7, 13–26.
Sullivan, J. and Joyce, P. (2005). Model selection in phylogenetics. Annual Review of Ecology, Evolution and Systematics, 36, 445–466.
Tamura, K. and Nei, M. (1993). Estimation of the number of nucleotide substitutions in the control region of mitochondrial dna in humans and chimpanzees.
Molecular Biology and Evolution, 10, 512–526.
Tavaré, S. (1986). Some probabilistic and statistical problems in the analysis of dna sequences. Some mathematical questions in biology - DNA sequence analysis.
Amer. Math. Soc., Providence, RI, pages 57–86.
Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of Mathematical Psychology 44:92-107, 44, 92–107.
Whelan, S. and Goldman, N. (1999). Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics. Molecular Biology
and Evolution, 16, 1292–1299.
Yang, Z., Goldman, N., and A.Friday (1995). Maximum likelihood trees from dna sequences: a peculiar statistical estimation problem. Systematic Biology, 44,
384–399.
Zharkikh, A. (1994). Estimation of evolutionary distances between nucleotide sequences. Journal of Molecular Evolution, 39, 315–329.
27