Phylofriend User Guide: Dirk Struve Phylofriend at Projectory - de
Phylofriend User Guide: Dirk Struve Phylofriend at Projectory - de
Dirk Struve
phylofriend at projectory.de
https://github.com/yogischogi/phylofriend/
March 20, 2018
1
Contents
1 Introduction 3
2 Installation 4
4 Examples 6
4.1 Create a Phylogenetic Tree . . . . . . . . . . . . . . . . . . . . 6
4.2 Pimp Your Tree with Nicer Labels . . . . . . . . . . . . . . . . 6
4.3 Use a Specific Set of Mutation Rates . . . . . . . . . . . . . . 7
4.4 Calibrate Your Data . . . . . . . . . . . . . . . . . . . . . . . 7
4.5 Count Mutations . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.6 Marker Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.7 Use Data from YFull . . . . . . . . . . . . . . . . . . . . . . . 9
4.8 Extract Data from a Spreadsheet in CSV format . . . . . . . . 10
5 Technicalities 11
5.1 Source Code Documentation . . . . . . . . . . . . . . . . . . . 11
5.2 Mutation Model . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Mutation Rates . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.4 CSV Input Format . . . . . . . . . . . . . . . . . . . . . . . . 13
5.5 Text Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.6 YFull Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.7 PHYLIP Format . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 Theory 15
6.1 Haplogroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.2 Haplotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.3 Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . 16
6.4 Genetic Distance . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.5 Modal Haplotype . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.6 Age Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.6.1 Mutation Counting . . . . . . . . . . . . . . . . . . . . 20
6.7 From Distances to Trees . . . . . . . . . . . . . . . . . . . . . 22
References 26
2
1 Introduction
Phylofriend’s main purpose is to calculate genetic distances from Y-DNA
STR values. The results can be used as input for the PHYLIP[6] program
to create phylogenetic trees.
When I started creating phylogenetic trees I often found myself in a dif-
ficult position. As a Linux user I was missing some of the tools available
under Windows. So I started to write this program to fill in the gaps and
make myself comfortable again.
This does not mean that you can not use Phylofriend when working under
Windows or the Mac. But currently there is no binary distribution available
and you will probably face a hard time installing Phylofriend and the asso-
ciated programs. So I only recommend this if you are an experienced user.
Phylofriend has some nice features. It can be used
• to extract STR values from Family Tree DNA projects and convert it
into simpler text files that are better suited for further processing.
Dirk
3
2 Installation
This guide is mainly targeted towards persons who use Linux Mint or other
Linux versions of the Debian family. Some familiarity with the use of Linux
commands is assumed.
Currently there are no binary distributions available for Windows or the
Mac. Users of these operating systems can use Phylofriend as well, but they
will experience some laborious installation work. The best way is to follow
the instructions provided on the Go home page and the PHYLIP home page.
The following list applies to Linux users only:
2. Read the Go Getting Started guide. Make sure to set your GOPATH
variable and include it in your PATH so that Go programs can be
found.
4
3 Command Line Options
Command line options may be given in arbitrary order.
-help Prints available program options.
-personsin Filename or directory of files containing the persons’ Y-STR
values. If this is a single file it must contain results for multiple persons.
The input file format is CSV (comma separated values) or text format.
If a directory is provided for input it must contain multiple files in
YFull format, each file containing the results for a single person. The
person’s ID is extracted from the filename.
personsin supports multiple file names separated by commas.
-labelcol Number of the column that is used for labels when reading CSV
files.
-mrin Filename of the mutation rates to use.
-model Mutation model to use. This may be hybrid or infinite. hybrid
uses uses stepwise counting for most markers except for the palindromic
ones. infinite uses the infinite alleles mutation model for all markers.
-anonymize If this is true persons’ names are replaced by numbers.
-modal Creates modal haplotype and performs TMRCA calculation.
-phylipout Filename for the distance matrix that can be fed into the PHYLIP[6]
program.
-mrout Filename for the output of the currently used mutation rates.
-txtout Filename for text output of persons and Y-STR values.
-htmlout Filename for HTML output of persons and Y-STR values.
-nmarkers Uses only the given number of markers for calculations.
-gentime Generation time.
-cal Calibration factor.
-reduce Reduces the number of persons by the given factor (for large num-
bers of samples).
-statistics Prints marker statistics.
5
4 Examples
4.1 Create a Phylogenetic Tree
1. Copy persons’ data from a Family Tree DNA project website into a
spreadsheet. If the Y-STR values do not appear properly try inserting
them into the spreadsheet as unformatted text.
6
You can access this column by using the labelcol option. Suppose your second
column contains names. You can create a distance matrix with names instead
of IDs by typing
phylofriend -personsin persons.csv -labelcol 2
-phylipout infile -mrin 67-average.txt
Due to compatibility issues with other programs the labels must be 10
characters long and may only contain 8-bit characters. Phylofriend will apply
a transformation to make sure that the requirements are fulfilled but the
result is sometimes a bit strange.
You can also use the labelcol option to create trees that contain the origins
of people or the haplogroups. Although I strongly recommend to build trees
only from people who belong to the same haplogroup this is sometimes useful
if you want to know if different haplogroups are close on their Y-STR values.
If you want to publish your tree you will often need to protect the privacy
of the members. This is what the anonymize option is for. By typing
phylofriend -personsin persons.csv -phylipout infile -anonymize
-mrin 67-average.txt
you will get a distance matrix where the names are replaced by numbers.
7
they are just multiplied together but using two separate factors seems more
convenient for typical use cases.
A generation time of 32 years has proven to show good results [1]. You
can use it by typing
phylofriend -personsin persons.csv -phylipout infile
-gentime 32 -mrin 67-average.txt
In reality an optimal result is often hard to achieve. Especially within the
range of genealogical time frames (about 400 years) and only small numbers
of persons who have tested, you are often left with a large statistical error.
Even worse, the method of Y-STR counting has numerous pitfalls, some of
which are discussed in [7].
If you have a reliable paper trail or a well defined historic event you
can calibrate your data using the cal option. With cal you just provide
an additional calibration factor that is multiplied to the calculated genetic
distances, for example
phylofriend -personsin persons.csv -phylipout infile
-gentime 32 -mrin 67-average.txt -cal 1.2
multiplies all genetic distances by a factor of 1.2.
8
The difference to the previous example is that the nmarkers option in the
previous example makes sure that all persons have tested for all 37 mark-
ers, while the use of the 37-count mutation rate also accepts persons who
have tested for a lower number of markers. The result is an estimate. This
option was introduced because next generation sequencing has revealed Y-
STR results for 587 markers. However due to the technical restrictions of
this method, different persons usually get results for different, but largely
overlapping, marker sets.
Each line contains the marker name and the minimal and maximal values.
After that each marker value is listed together with it’s frequency, separated
by a colon.
Let’s take DYS391 as an example. It has two values, 10 and 11. The
value 10 occurs 1 time and the value 11 occurs 14 times.
9
phylofriend -personsin inputdir -phylipout infile -modal
-mrin 587-count.txt
This command will use up to 587 markers for comparison. The effect of
the file 587-count.txt is that the mutation differences are just counted. An
average mutation rate for 587 markers is not available.
10
5 Technicalities
5.1 Source Code Documentation
To access the source code documentation point your web browser to:
• http://godoc.org/github.com/yogischogi/phylofriend
If you want to modify the source code it is best to use godoc locally on
your computer.
This will give you a nice overview of the internal program documentation.
You can also click on function names to browse the source code.
11
Average Mutation Rates
• 12-average.txt
• 37-average.txt
• 67-average.txt
• 111-average.txt
In these files average mutation rates are used for the corresponding set
of markers. The mutation rates for 12, 37, 67 and 111 markers were taken
from [9].
• 37-12.txt
• 67-12.txt
• 111-12.txt
These files are basically the same as the average mutation rates files but
for the first 12 markers the mutation rates are set per marker. This puts ap-
propriate weight on very stable markers. The mutation rates were taken from
Wikipedia[12]. Although there may be some doubt if data from Wikipedia
can be trusted the first 12 markers are well known and they are in use for a
long time. So I just adopted them. The values should be good enough for
most purposes.
For each file the 12 marker mutation rates were multiplied by a calibration
factor so that their average value matches that from the rest of the file.
These files are useful for deeper history or if you observe changes on long
time stable markers. If several persons share the same value on a very stable
marker they probably belong together. So try these mutation rates to see if
the results make more sense than those obtained by using average markers.
• 37-count.txt
12
• 67-count.txt
• 111-count.txt
• 587-count.txt
These mutation rates can be used for marker counting between differ-
ent persons, even if they have not been tested on the exact same markers.
Phylofriend will give you an estimate about the expected marker difference
on the specified scale.
This approach is recommended for large marker sets from next generation
sequencing.
id1,"Dirk Struve",Germany,R1b-CTS4528,13,24,14,11,11-14,12,...
id2,"Pyl. O. Friend",Germany,R1b-CTS4528,13,24,14,11,11-14,...
When importing a file in comma separated values format the first column
must contain IDs. An arbitrary number of columns containing custom infor-
mation may follow. The last columns must contain at least 12 Y-STR values
in Family Tree DNA order. Rows containing comments are allowed.
Phylofriend will always try to parse the file as best as it can.
Dirk_Struv 13 24 14 11 11 14 12 12 12 12 14 28
Pyl._O._Fr 13 24 14 11 11 14 12 12 12 12 14 28
The text format is a simplified format intended for easy parsing and to
work well with other programs. For compatibility reasons the first column is
exactly 10 characters long and contains only non Unicode characters. Spaces
are transformed into underscores. The following columns contain Y-STR
values separated by tabs.
13
5.6 YFull Format
YFull files have a name like STR for YF01234 20160222.csv. Each line of
the file contains the result for a single marker and sometimes additional
information. Example:
DYS390;24;
DYS391;11;
DYS392;14;?
DYS393;13;
Although the file is in CSV format, semicolons are used instead of commas
as separators. This may cause trouble if you try to add additional results
by hand. Phylofriend does not care if a marker is provided multiple times.
The last occurrence of a name is considered the valid one. Thus you can add
results from other testing companies just by adding them to the end of the
file.
Phylofriend tries to extract a person’s ID from the filename. If a file is
named STR for YF01234 20160222.csv, the ID will be YF01234.
If you want to provide your own files, you do not need to stick to the YFull
naming convention. Just use the desired ID as a filename like ID1234.csv,
but the file must end in .csv.
2
Dirk_Struv 0 0
Pyl._O._Fr 0 0
The first line contains the number of entries. An entry line contains
an ID that is 10 characters long and contains only non Unicode characters.
Spaces are transformed into underscores. The columns containing genetic
distances are separated by tabs. For readability reasons Phylofriend writes
only integers. If you need more precision you can scale the distance by using
the cal option.
14
6 Theory
6.1 Haplogroups
Parts of the human Y chromosome are only inherited from father to son
[1, 5]. Usually these parts are unchanged but sometimes a mutation at a
single position occurs. Such mutations are called SNPs (Single Nucleotide
Polymorphisms). They are very stable and can be used to group people into
ancestral lines.
15
6.2 Haplotypes
Most people want to know more about family relationships. Relationships
are often verified by STR (Short Tandem Repeats) mutations. STRs are
repetitions of certain genetic patterns. They group people into haplotypes.
16
Figure 3: Phylogenetic trees group persons together according to their
genetic distances. Genetic distances are often associated with time scales
but this is only true for long time spans.
before the sons and associate the horizontal axis with a time scale assuming
that genetic distance is proportional to time.
Reality however shows a different picture. The currently available stan-
dard tests (37, 67 and 111 markers) only offer a low resolution. So most often
we can not measure any genetic distance between father and sons. This is
illustrated by tree b. Because the tree illustrates genetic distances and not
time distances father and sons are all side by side.
If son 2 develops a mutation by chance we get a confusing picture. Son
2 is suddenly far away from his father and his brother. This is shown by
tree c. The reason for the great distance between father and son 2 is the low
resolution of the test. If we take the mutation rates from [9] we can calculate
how long it takes on average for a mutation to occur. For a 37 marker test
the result is 280 years, for a 67 marker test 210 years and for a 111 marker
test 130 years.
This means if the father and his sons took a 37 marker test and son 2 has
developed a mutation by chance he will appear to be 280 years away from
his father but the only reason for that is the low resolution of the test.
In most cases the standard tests are good enough. We do not need a
paternity test for genealogical purposes but it is important to remember that
the standard tests only places persons into time frames of hundreds of years.
17
6.4 Genetic Distance
There are many ways to calculate genetic distances. Here we use the method
of mutation counting as described by Anatole Klyosov in [8]. We just count
the number of mutation one persons differs from another and take the result
as the genetic distance. Take a look at these two 6-marker haplotypes:
At DYS390 and DYS391 Clas differs from Carl by one mutation. Thus
their genetic distance on the 6-marker scale is 2.
To represent this genetic distance in years we need to know how often
mutations occur. Thus we need a mutation rate. The mutation rate is
defined as follows:
m
k= (1)
gy
k: Mutation rate per marker and generation
m: Number of mutations
g: Number of generations
y: Number of Y-STR values (markers)
Mutation rates are derived from sample populations. But mutations occur
by coincidence. Thus for a precise measurement we need large numbers of
mutations. Different family lineages often expose different mutation rates.
So the standard mutation rates should be considered as a first estimate. You
can use Phylofriend’s cal option to adjust your data to historic events.
Now we can use equation 1 to calculate a genetic distance in generations.
Formula 1 is equivalent to:
m
g= (2)
ky
This gives us the number of generations between Carl and Clas. The only
thing we still need to know is the generation time. For historical purposes a
generation time of 25 years is commonly used. We get the time by multiplying
the number of generations with the generation time.
t = gd (3)
18
t: Genetic distance in years
g: Number of generations
d: Generation time in years
Let us try this out. The number of mutations between Clas and Carl is
2. The mutation rate for a 6-marker haplotype is 0.002 [8]. So
2 · 25
t= = 4167 years (5)
0.002 · 6
Wow, that is a very long time. But what does this number actually mean?
First, it is an average value and there is a very big margin of error to it but
it is useful as a first estimate and to group people together according to their
genetic distance. Second it is the time Clas and Carl would be separated if
Clas would be an ancestor of Carl (or the other way round). It is not the
time to their most recent common ancestor (TMRCA).
If Clas and Carl would be living today, how long would be the the time
to their most recent common ancestor? The first guess is that the common
ancestor would be in-between his descendants in terms of genetic distance.
So Claus and Carl would be 4167years/2 = 2085years away from him.
This is a good first guess but unfortunately reality is much more compli-
cated. Mutations occur by coincidence and due to the laws of statistics some
people develop only a few mutations while others develop much of them. So
our first guess may give a totally wrong impression. Generally it is not a good
idea to calculate the time to a common ancestor just based on the results
of two people. The best way is to identify a group of people who share a
common lineage and calculate the time to the most recent common ancestor
for the whole group. Anatole Klyosov describes how this is done in [8].
For those who still want to use a TMRCA value based on the results of just
two people Bruce Walsh has developed a method to give a time estimate[14].
This is better than the naive calculation we have used before but it is still
an estimate and it is only valid for demographically stable populations.
19
The whole story is of course more complicated than the simple calculation
presented here. Different mutation models exist and some genetic markers
should be treated in a special way. Phylofriend uses a hybrid mutation
model that is a mixture of the stepwise mutation model and the infinite
alleles model. Both models are explained by Bruce Walsh in [13].
20
take the average. This should be a good estimate for the group’s genetic
distance to their ancestor.
Again we use the definition of the mutationrate 1 but this time we use
the average value:
m̄
k= (6)
gy
k: Mutation rate per marker and generation
m̄: Average number of mutations relative to modal haplotype
g: Number of generations
y: Number of Y-STR values per haplotype
21
The following example shows how to use the formula but it is intended as
an easy to understand educational exercise. For real world applications more
haplotypes (rule of thumb: use at least 5) and a bigger number of average
mutations (4 or more is quite good) is needed.
The table on page 20 contains three haplotypes on the 6-marker scale
(N = 3, y = 6). Peter and Carl differ by one mutation to the modal hap-
lotype and Clas by two. We use a generation distance of 25 years (d = 25)
and the average mutation rate on the 6-marker scale is 0.002[8]. So we get
1+1+2
t = 25 · = 2778 years (10)
3 · 0.002 · 6
Of course the simple method of mutation counting comes not without
problems. The biggest one is the correct distribution of the population sam-
ple. For example if you want to calculate the TMRCA for a family which is
just a few hundred years old and add a person by accident who is at a distance
of several thousand years your TMRCA value will be much too large.
On the other hand if you want to calculate the age of a whole haplogroup
and add a large number of members from one family who are all very close
together your TMRCA value will be much too small. So this method must
be handled with care.
A more general problem is that mutation rates are derived from popu-
lation samples and there is no guarantee that they fit to your data set. So
whenever possible try to find some historical event to calibrate your data.
22
Figure 4: Genetic distances can be represented by phylogenetic trees. Dif-
ferent algorithms often yield different results.
23
Figure 5: Haplotypes and genetic distances of two families. Although both
lineages have a different modal (ancestral) haplotype the haplotypes of in-
dividuals from both families are sometimes close together. In such cases
phylogenetic tree algorithms yield wrong results.
times members from different families come close to each other. In such cases
all algorithms will give you wrong results.
This is the reason why it is so important to test for haplogroups before
creating a phylogenetic tree. Haplogroups are caused by very stable muta-
tions. Haplotypes often overlap. So it should be ensured that all persons in
a tree belong to the same haplogroup.
24
References
[1] Dmitry Adamov, Vladimir Guryanov, Sergey Karzhavin, Vladimir
Tagankin, Vadim Urasin. Defining a New Rate Constant for
Y-Chromosome SNPs based on Full Sequencing Data. The Russian
Journal of Genetic Genealogy (Ðóññêàÿ âåðñèÿ), Vol 6, No 2
(2014)/Vol 7, No 1 (2015).
[4] R. A. Canada, How does the infinite allele comparison method work
for palindromic markers?. Family Tree DNA, 2014, Date accessed:
2014-08-07.
[5] Family Tree DNA, Big Y White Paper. Family Tree DNA, August
2014.
[10] ISOGG, Listing Criteria for SNP Inclusion into the ISOGG Y-DNA
Haplogroup Tree - 2015. Date visited: 2015-10-06.
25
[11] Jonathon Shlens, A Tutorial on Principal Component Analysis.
Center for Neural Science, New York University, New York City,
Systems Neurobiology Laboratory, Salk Insitute for Biological Studies,
La Jolla, 2009.
[14] Bruce Walsh, Estimating the Time to the Most Recent Common
Ancestor for the Y chromosome or Mitochondrial DNA for a Pair of
Individuals. Genetics Society of America, 2001.
26