aafUserManual
aafUserManual
Feb 2016
Table of Contents
Requirements 2
Installation 2
Usage and options 3
1) aaf_phylokmer.py 3
2) aaf_distance.py 4
3) aaf_tip.py 5
4) nonparametric_bootstrap.py 5
5) parametric_bootstrap.R 6
Tutorial with dummy dataset 7
Description of output files 7
Parameter Selection 8
a. Optimal k 8
b. Filter or not 10
c. Tip trimming (optional) 10
d. Bootstrap 10
Reference 12
Introduction
AAF (alignment and assembly-free) is a free software package that reconstructs phylogeny from next-
generation sequencing data without assembly and alignment. It takes raw sequencing reads from each
sample altogether and generates a distance matrix based on the proportion of shared k-mers between
each sample and reconstruct a phylogeny based on the distance matrix.
AAF is mainly designed for big Eukaryotes genomes. Therefore we divided the whole reconstruction
process into two major steps: 1) k-mer counting and 2) distance calculation and phylogeny
reconstruction. We have two separate python scripts taking care of the two steps respectively: 1)
aaf_phylokmer.py 2) aaf_distance.py. There are 3 more optional scripts in the AAF package. One for
trimming excessive tips of the phylgenye generated due to sequencing error and imcomplete coverage
(aaf_tip.py); two for doing bootstraps for the phylogeny constructed (nonparametric_bootstrap.py and
parametric_bootstrap.R). In the rest of the manual we will introduce their usage and options
respectively.
We have included two tutorials in this manual. One is a dummy dataset with 10 species and short
genomes and the other is a real dataset with 21 tropical tree genomes as described in the AAF paper.
The first one is used to showcase how to organize data and run the first scripts. The second one is used
to demonstrated possible issues while dealing with real and big dataset, with special focus on parameter
selection including k and filtering. There is a separate section detailing the reasoning for optimal
parameter selection as well.
Requirements
AAF can be used on a UNIX system (Linux, OsX...) with Python 2.6 and higher 2.X versions (NOT
Python 3.0+), and g++/gcc compilers. Biopython (http://biopython.org/wiki/Main_Page) is required for
the non-parametric bootstrap, and R (http://cran.r-project.org/) and the R package 'ape' are required for
the parametric bootstrap.
Installation
0. Decompress the zip file downloaded from http://sourceforge.net/projects/aaf-phylogeny with the
most recent version.
1. Compile kmer_count(x) and kmer_merge as follows. "path_to_AAF" stands for your path to the
AAF folder generated by decompressing AAF.tar.gz.
a. path_to_AAF/AAF$ cd phylokmer
b. path_to_AAF/AAF/phylokmer$ make
c. Add kmer_count(x) and kmer_merge to your PATH or working directory
2. Compile fitch_kmerX, consense and treedist
a. path_to_AAF/AAF$ cd phylip_src
b. path_to_AAF/AAF/phylip_src$ make all
c. Add fitch_kmerX and consense to your PATH or working directory
1) aaf_phylokmer.py
Usage: aaf_phylokmer.py [options]
Options:
--version show program's version number and exit
-t NTHREADS: number of threads to use. Depends on how many cores are available on your machine.
Set at 1 by default.
-n FILTER: how many times a k-mer needs to be in the sample to be counted as present. This serves as
the filter for singletons, which could be the result of sequencing error. See more details about the
parameter selection section. Set at 1 by default.
-f SEQFORMAT: format of the sequence files, FA or FQ. The default is set as FA.
-o OUTFILE: output filename. If you would like your output file to be compressed, provide a name
that ends with .gz. Otherwise it will not be compressed. The default output file is phylokmer.dat.gz,
which is compressed.
-d DATADIR: directory containing the data. Users should strictly follow the data structure required by
AAF. Sequence files for each sample need to be in one directory named after that sample. Therefore,
there will be N directories for N samples and the name of the directories will be the names displayed in
the final phylogenetic tree. All the sample folders should be placed into the same directory, which will
be your data directory requested by aaf_phylokmer.py. Accepted extensions for sequence files
include: .fa(sta)(.gz), .fq(.gz), and .fastq(.gz). See the “data” directory in the package as an example.
-G MEMSIZE: the total memory allowance. Each kmer_count thread has G/t memory allowance. Set at
4G by default.
-W WITHKMER: to include k-mers in the shared k-mer table. When the final goal is to construct a
phylogeny, we do not need to know the specific patterns of each k-mer. Therefore by default in the
shared k-mer table only the frequencies of k-mers are kept. However if there’s downstream analysis of
k-mers with a certain pattern, k-mers need to be kept. Use -W to keep the k-mers.
-s SIM: This will print out the commands that are going to run without executing them.
2) aaf_distance.py
Usage: aaf-distance.py [options] -i <input filename>
Options:
--version show program's version number and exit
-t NTHREADS: number of threads to use. Depends on how many cores are available on your machine.
Set at 1 by default.
-o OTPF: prefix of the output files, including the distance matrix(.dist) and the phylogenetic tree(.tre).
Default is set as “aaf”.
-f COUNTF: wc file generated from kmer_count. This file contains the k-mer diversity of each sample.
The default is phylokmer.dat.wc
3) aaf_tip.py
Usage: aaf_tip.py [options] -i <input tree file> -k <kmer size> --tip <information for tip correction>
Options:
--version show program's version number and exit
--tip TIP_FILE: To trim the excess tips caused by incomplete coverage and sequencing errors requires
additional info on the average coverage, read length and sequencing error of each sample. Put this
information into a tab delimited text file in the format of tip_info_test.txt. See suggestions on
estimation of coverage and sequencing error in Parameter Selection section.
-n: add it to the command if filter was used during the tree construction.
-f COUNTF: wc file generated from kmer_count. This file contains the k-mer diversity of each sample.
The default is phylokmer.dat.wc
4) nonparametric_bootstrap.py
Usage: nonparametric_bootstrap.py [options]
Options:
-h, --help show this help message and exit
-t NTHREADS: number of threads to use. Depends on how many cores are available on your machine.
Set at 1 by default.
-n FILTER: how many times a k-mer needs to be in the sample to be counted as present. Set at 1 by
default.
-f SEQFORMAT: Format of the sequence files, FA or FQ. The default is set as FA.
-o OUTFILE: file name of the merged k-mer table. If you would like your k-mer table to be
compressed, provide a name that ends with .gz. Otherwise it will not be compressed. The default output
file is phylokmer.dat.gz, which is compressed.
-G MEMSIZE: the total memory allowance. Each kmer_count thread has G/t memory allowance. Set at
4G by default.
--S1: number of times to resample the reads for each sequence file. This is the first stage of our two-
stage bootstrap. This bootstrap result shows the variance in sequencing error and incomplete coverage.
Set at 0 by default, which means skip the first stage of bootstrap and only resample the k-mer table.
--S2: number of times to resample the total k-mer table generated from one instance of resampling of
the reads. If --S1 is set to be 0, the resampling is on the real k-mer table generated from the original
data. Set at 0 by default, which means skipping this step.
5) parametric_bootstrap.R
When it takes too long to bootstrap over large datasets, switch to the parametric bootstrap. This R script
provides estimation of the variances in the two steps. It requires:
info file: containing read length, sequencing error and coverage and used in aaf_tip.py, default =
tip_info_test.txt
nshare file: containing the number of shared kmers generated by aaf_distance.py (ends with
_nshare.csv), default = test_nshare.csv
nreadboot: number of replicates, default = 10
k: k-mer length used in previous steps, default = 21
i.filter: filter threshold used, default =1.
2) Move to the phylokmer directory and compile kmer_count, kmer_countx, and kmer_merge
path_to_AAF/AAF$ cd phylokmer
path_to_AAF/AAF/phylokmer$ make
path_to_AAF/AAF/phylokmer$ cp kmer_count kmer_countx kmer_merge ../
3) Compile fitch_kmerX
path_to_AAF/AAF/phylokmer$ cd ../phylip_src
path_to_AAF/AAF/phylip_src$ make all
path_to_AAF/AAF/phylip_src$ cp fitch_kmerX consense ../
path_to_AAF/AAF/phylip_src$ cd ..
4) k-mer counting
path_to_AAF/AAF/$ python aaf_phylokmer.py -k 21 -d data -G 2
Parameter Selection
Here we provide some guidelines in parameter selection using the dataset with 21 tropical trees as an
example.
a. Optimal k
As we described in the manuscript, the selection of k is a trade-off between avoiding multiple
mutations on one k-mer (which favors shorter k) and decreasing the chances of k-mer homoplasy
(which favors longer k). For the primate dataset in the manuscript, we plot the theoretical predictions of
the proportion of shared k-mers, ph, calculated from the observed frequency distribution of k-mers and
the ph calculated without homoplasy (Fig. 2D) to help view the effect of different choices of k. This
procedure led to the selection of k that corresponded to an accurate phylogeny. Therefore this figure
serves as a good indicator for optimal k, and this choice can be further proved by constructing
phylogeny with k-mer lengths larger than optimal, in order to check the phylogenetic consistency.
To plot the ph vs. k figure for your dataset, here is a checklist for the genome information that is
needed:
i. Sample names
ii. Coverage
iii. Genome size
iv. GC content
v. d (genetic distance)
vi. Qk
We are aware that this information might not be all available, and we provide coarse calculation
methods for some of the categories.
ii. Coverage
There are multiple ways of estimating the sequencing coverage of your next-gen sequencing data. (1) If
the genome size is known, coverage = total bp / genome size. (2) If the genome size is unknown, we
can estimate the coverage by plotting the k-mer frequency distribution: “if a large fraction of k-mers
occur c times, we can estimate the sequencing coverage to be approximately c and derive an estimate of
the genome size from c and the total length of the reads.” (Marcais and Kingsford 2011)). c will be the
k-mer coverage. To get the base pair coverage, you need correct c using base_coverage = c *
read_length / (read_length - k-mer_size + 1) (see https://groups.google.com/forum/#!topic/bgi-
soap/xKS39Nz4SCE). (3) When the coverage is low or sequencing error rate is high, there will be no
clear peak in the k-mer frequency distribution at c. This is actually the case for all the tropical tree
species in our dataset except Ficus vasculosa (FV). A coarse estimation of the k-mer coverage will be
the total number of k-mers (including multiple copies of the same k-mer) divided by k-mer diversity
(number of k-mer that shows up at least once). Some assemblers (such as velvet, SOAPdenovo) report
estimation of k-mer coverage as well.
Coverage information is also needed for tip correction.
iv. GC content
There are many tools to calculate the GC content of your samples. In the AAF package we have
provided our own, gc.py in the utils folder. Biopython needs to be preinstalled.
v. d (Genetic distance)
The genetic distance of the group (average number of mutations per base pair) is used to set the scale of
the vertical axis in the ph vs. k figure. Because the figure is used to find k on the horizontal axis, the
conclusions from this figure about selecting k are mostly independent of the selection of d, so this
selection does not need to be very accurate. A reasonable strategy is to guess d, or use the default 0.1,
to select k. The subsequent phylogeny construction will give a good estimate of d from the distance
matrix, which then could be used to plot the figure.
vi. Qk
There is more than one way of calculating the frequency distribution of k-mers. One of the easiest ways
is to turn on the --stats option while counting the k-mers using jellyfish(Marcais and Kingsford 2011).
However the maximum k that jellyfish can handle is 31. For k>31, use kmer_countx to count the k-
mers, then calculate the frequency distribution of k-mers from the pkdat files (the output file of
kmer_count(k≤25) and kmer_countx(k>25)) using the pkdat2hist.py in the utils folder that is provided
in the tutorial folder.
After gathering all the information, we generated the ph vs. k figure for the 21 tropical trees dataset
(Fig. S6) using the R code phVSk.R in the utils folder. The trend for all the red lines (estimated ph
based on the Qk for each species) stabilized for k ≥ 25, and the difference between the red lines and the
black dashed line continued to decrease with larger k. Therefore, we constructed phylogenies for k from
25 to 31, and because the phylogenies were identical for k ≥ 27 (Fig.7), we selected 27 as the optimal k
for the tropical trees dataset. The same phylogenetic topology was also obtained when k-mers were
filtered to remove singletons. For k greater than 31, the topology within the Ficus group showed some
small changes. We suspect that this is due to the loss of sensitivity to evolutionary changes when
selecting k-mer lengths too long, especially for relatively small genomes (as the Ficus group has
genome sizes less than half those of the other species).
To plot the ph vs. k figure for your dataset, simply replace the genome information for the
tropical trees with the information for your own dataset in the beginning of the R script phVSk.R.
b. Filter or not
In deciding whether or not to filter k-mers (i.e., only including k-mers if they occur at least twice in a
taxon), it is necessary to know the balance between loss of information through false k-mers that
caused by sequencing errors if there is no filtering, and loss of information through removing true
singletons if there is filtering (Fig. 5 in the manuscript). If there is a large range of coverages among
taxa within a dataset, it is best to decide whether or not to filter based upon the taxon with the lowest
coverage, because there is a large negative consequence of filtering low-coverage taxa (Fig. 5). For the
tropical trees, we chose not to filter because more than half of the species have coverage less than 5.
d. Bootstrap
i. Nonparametric vs. parametric bootstraps
Nonparametric bootstrapping can be computationally intensive when the dataset is large (>100G in
total). If you think it takes too long, switch to the parametric one. Also with large genomes, the
bootstrap value tends to stay as 100%.
0.4
d = 0.1
0.2
0.0
15 20 25 30 35
k
Figure S6. Theoretical predictions of the proportion of shared k-mers, ph, calculated from the observed
frequency distribution of k-mers, Qk, for the tropical trees dataset ranging in genome size from 250M to
2Gbp assuming the true distance between taxa is d = 0.1 (divergence time 94Mya).
Reference:
Marcais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences
of k-mers. Bioinformatics 27: 764–770.