Manual Pyrad

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Full Tutorial: pyRAD v.3.

0
date: 12/4/2014
author: Deren Eaton | [email protected]
Help forum: google group link (https://groups.google.com/forum/#!forum/pyrad-users)

This is the full tutorial for pyRAD v.3.0. It is an attempt to explain the inner workings of each parameter
setting used in assembling RADseq data sets in pyRAD. I highly recommend that users begin by
doing a run through of the RAD tutorial (v.3.0)
(http://nbviewer.ipython.org/gist/dereneaton/1f661bfb205b644086cc/tutorial_RAD_3.0.ipynb), which
only takes a few minutes, and will help you to gain a general sense for how the program works, and
what the data and results look like. Then follow up with a more detailed look at this tutorial to see the
full range of options available, and if you still have questions, please send them to the google group at
the link above.

Contents:
1. Why use pyRAD?
2. Installation
3. Input files
4. Running pyRAD
5. RADseq/RRL datatypes
6. Example Analyses
7. Output data formats
8. Hierarchical clustering
9. FAQs

1. Why use pyRAD?


Reduced-representation genomic sequence data (e.g., RADseq, GBS, ddRAD) are commonly used to
study population-level research questions and consequently most software packages for assembling
or analyzing such data are designed for sequences with little variation across samples. Phylogenetic
analyses typically include species with deeper divergence times (more variable loci across samples)
and thus a dierent approach to clustering and identifying orthologs will perform better.
pyRAD, the software pipeline described here, is intended to maximize phylogenetic information across
disparate samples from RAD, ddRAD, or GBS data, hereafter referred to generally as RADseq. Written
in Python, the program is flexible and easy to use. The code is human-readable, open-source and
may be modified to meet users specific needs. With respect to constructing phylogenetic data sets,
the largest dierence between pyRAD and alternative programs such as Stacks, is in the way that loci
are clustered. Stacks clusters reads by grouping those with less than N dierences between them, but
does not allow for indels, such that the presence of even a single indel will cause a frameshift resulting
in many base dierences between otherwise highly similar sequences. pyRAD instead uses a measure
of sequence similarity based on a global alignment of sequences through the program VSEARCH (or
USEARCH). Alignment clustering allows for indels, and should perform better in data sets including
more distantly related samples.
pyRAD was written with the aim that it be easy to use, fast, and flexible. It is easy in that installation
does not require any dicult libraries or extensions, and that it requires few arguments to the
command line. It is fast, using both multi-processing and multi-threading, and also oers a number of
heuristics to speed clustering of extremely large data sets. And it is flexible in that it has built-in
methods to account for dierent library types (RAD, ddRAD, GBS, paired-ddRAD, paired-GBS) as well
as quality filtering, including overlap among paired or reverse-complemented reads, and the presence
of adapter sequences. Thus, data require little or no filtering before before analysis in pyRAD.
Two fast clustering methods: hierarchical-clustering for large data sets, and split-clustering for pairedend ddRAD data are described further below. Using these methods data sets with even many
hundreds of individuals can be assembled in a few days by distributing across multiple processing
cores. pyRAD can be used to perform D-statistic tests on RADseq loci to test for historical
introgression. Assembled data sets can be output in a number of formats for easy input into external
applications such as BEAST, RAxML, TreeMix, Structure, Admixture and migrate-n.
An open access pre-print describing pyRAD v.1 and comparing it with Stacks is available here: link
(http://biorxiv.org/content/early/2013/12/03/001081)

2. Installation
pyRAD
You can download a stable release that will work on Mac or Linux here
(https://github.com/dereneaton/pyrad) or clone the git repository onto your machine with the following
command:

In [ ]:
git clone https://github.com/dereneaton/pyrad.git
With git, future updates can be attained by using cd into the pyrad directory and running git pull.
pyRAD requires the common Python packages Numpy and Scipy, and two external programs, muscle
and vsearch.

Numpy & Scipy:


These will already be available on almost any HPC machine. To install them on a personal machine you
can use the apt-get command sudo apt-get install python-numpy python-scipy on
Ubuntu Linux. On a Mac there are several ways to install these, but it varies depending on the version
of your OS. I recommend using conda install which you should be able to find with google.

Muscle
Muscle is used for sequence alignments. This will already be available on most HPC machines. It can
be installed on Ubuntu linux with the command apt-get install muscle, or you can find binaries
at www.drive5.com/muscle/downloads.

Vsearch
Vsearch is used to cluster sequences by similarity. Both source and binaries are available here:
https://github.com/torognes/vsearch (https://github.com/torognes/vsearch). Older versions of pyrad
required the program Usearch (usearch_v.7.0.1090) for clustering, and this is still supported.
However, I now recommend vsearch as a replacement because it is free and open-source, and I've
found that it can perform more than 30X faster while yielding identical results. The speed improvement
does require greater memory load.

Take note of the location where you install muscle and vsearch. Below I will describe how to ensure
pyRAD can find your installations.

3. Input files
3.1 - the barcodes file
The barcodes file is a simple table linking barcodes to samples. Barcodes can be of varying lengths.
Each line should have one name and one barcode, separated by a tab or space.

In [ ]:
sample1
sample2
sample3

ACAGG
ATTCA
CGGCATA

sample4

AAGAACA

If your data are already de-multiplexed (sorted) into separate files for each individual then a barcodes
file is not needed.

3.2 - the params file


The parameter input file, usually named params.txt, can be created with the -n option in pyRAD, like
below. This file lists all of the options or paramater settings necessary for an analysis.
Every line has a parameter (e.g., a number or character string) followed by any number of spaces or
tabs and then two hash marks (##), after which the parameter is described, or comments can be
added. In parentheses I list the step of a pyRAD analysis aected by the parameter. If a line is left
blank the default value is used. I highly recommend beginning all analysis by creating an empty
template params.txt file with the following command:

In [ ]:
pyRAD -n
The following section describes the params file line by line. The first 14 lines are required, all lines
following this are optional and used for more advanced analyses or data filtering. For each input line I
list several possible formats:

Required lines 1-14


Line 1: The working directory for your analysis. Three dierent examples are shown below.
This is where all of the output files from the analyses will be placed.
In [ ]:
./
/home/user/RAD/

## 1. uses current as working directory


## 1. uses current as working directory
## 1. uses set location as working directory

Line 2: The location of the raw fastq formatted Illumina sequence data. Use the wildcard
character * to select multiple files in the same directory. Paired-end data files should be in
pairs that dier only by the presence of a "R1" and "R2" in their names. Data files can be
gzipped.

In [ ]:
./raw/*.fastq
raw/*.fastq.gz

## 2. path to raw data files


## 2. path to raw data files (gzipped)

Line 3: The location of the barcodes file (described above). If your data are already demultiplexed then lines 2 and 3 can be left blank, but line 18 (see below) must be entered.
Example:
In [ ]:
./barcodes.txt
mydata.barcodes

## 3. path to barcodes file


## 3. path to barcodes file

Line 4: The command to call vsearch (or usearch). If vsearch is in your $PATH, then simply list
the executable file name here. Otherwise, enter the full path to vsearch that is required for you
to call it from the command line. Examples:
In [ ]:
vsearch
/user/bin/usearch7.0.1090_i86linux32
usearch7.0.1090_i86linux32

## 4. in systemPATH
## 4. full path
## 4. in system PATH

Line 5: The command to call muscle. Similar to above.


In [ ]:
muscle
muscle3.8.31_i86linux32

## 5. path to call muscle


## 5. path to muscle long name

Line 6: The sequence of the restriction recognition site overhang. If your data are ddRAD list
the common cutter (attached to first reads) followed by the rare cutter (attached to second
reads), separated by a comma. If you're not sure of the sequence of your cutters look at your
raw data files for the common sequence found at the left side of your reads. Cutters can
contain ambiguous bases (Example: CWGC). Examples:
In [ ]:
TGCAG
CWGC,AATT

## 6. PstI cutsite overhang C|TGCAG


## 6. ddRAD example for apeKI, EcoRI

Line 7: The number of parallel jobs to run. For most non-clustering steps this will also be the
number of processing cores that are used. Parallel processing in pyRAD usually works by
executing individual samples separately on dierent processors, so you rarely gain speed by
setting a number greater than the number of samples in your data set. However, for clustering
steps 3 and 6 the number of threads per parallel job can be set to use many more cores (see
line 36). Example:

In [ ]:
8

## 7. N processors
Line 8: The minimum depth of coverage to make a statistical base call at each site in a
cluster. Depending on the inferred error rate, five is typically the minimum depth at which a
base can be called. For popgen analyses you may want to use a higher minimum depth, like
10. If your data are very low coverage you can instead call majority consensus base calls. For
this, set a low number here (e.g., 2), and turn on majority rule calling (line 29; see below).

In [ ]:
5

## 8. mindepth for base calls


Line 9: The maximum number of low quality, undetermined (N) sites in step 2 filtered
sequences. This number should be selected with respect to the clustering threshold. A low
clustering threshold (.85) can allow more undetermined sites; whereas a high clustering
threshold (.95) should exclude reads with too many undetermined sites or they will aect
clustering. Similarly, very long sequences (200 bp paired ddRAD) should allow more Ns since
there are more bases and because second reads may be lower quality.

In [ ]:
4

## 9. maxN: max number of Ns in reads


Line 10: The similarity threshold used by vsearch for sequence clustering, entered as a
decimal. This value is used for both within-sample clustering and across-sample clustering. I
(more or less) require you to use the same threshold for both, and recommend doing so.
Given the way the clustering algorithm works it only makes sense (to me) to do so, otherwise
reads that do not cluster together within a sample will later cluster together when you do
across-sample clustering, which is undesirable. During clustering the cutsite bases remain
attached to the reads, thus all loci will have an invariable set of ~4-6 bases on one side. For
100 bp reads containing a 5-bp cutter and with have a 5 bp barcode trimmed o this results
in 95 bp of data, 5 of which are invariable. Thus, if you want to allow 5 base dierences
between reads (90/95 = .947) you should use a setting of .94. Repeat regions are masked
using the 'dust' method by default (see masking; line 35). Example:

In [ ]:
.90

## 10. clustering threshold


Line 11: The type of Reduced representation library. There are six options, currently. Entered
as all lowercase lettering: rad, ddrad, gbs, pairddrad, pairgbs, merge. The dierent treatments
of these data types are described below in section RADseq/RRL datatypes.

In [ ]:
rad
pairddrad

## 11. Datatype
## 11. Datatype

Line 12: Minimum taxon coverage. The minimum number of (ingroup) samples with data for a
given locus to be retained in the final data set. If you enter a number equal to the full number
of samples in your data set then it will return only loci that have data across all samples. If you
enter a lower value, like 4, it will return a more sparse matrix, including any loci for which at
least four samples contain data. Example:
In [ ]:
4

## 12. minCov
Line 13: Maximum number (or proportion) of shared polymorphic sites in a locus. Enter a
number, or a decimal with the prefix p (e.g., p.10 for 10%). This option is used to detect
potential paralogs, as a shared heterozygous site across many samples likely represents
clustering of paralogs with a fixed dierence rather than a true heterozygous site. Example:

In [ ]:
4

## 13. maxSharedH.
Line 14: Prefix name for final output files. I usually enter a name indicative of the other
parameters from the lines above. Example:

In [ ]:
arabidopsis_c90d5m4p4

## 14. output name prefix

optional lines 15-37


These options are not necessary for all analyses and can be left empty in the params file, in which
case pyRAD uses their default values. They mostly pertain to options for additional quality filtering, for
paired-end analyses, or for increasing the speed of analyses.

Line 15: Subset selector for steps 2-7. If, for example, you have data for a number of
individuals from two dierent clades, and for one clade all of the sample names begin with the
letter A, then you can select only these samples to analyze by entering the shared prefix
"A* on this line.
In [ ]:
A

## 15. selects prefix subset of files


Line 16: Add-on (outgroup) taxa selector. The MinCov parameter on line 12 tells pyRAD to
keep only loci with data from at least n ingroup samples. If you wanted to maximize coverage

within a certain clade you could set the other taxa as "outgroups", and only loci with at least
MinCov ingroup samples will be kept, and any additional matching of outgroup taxa to those
will be included in the data, but not count toward the minimum number of samples. List
outgroup taxa names here separated by commas.

In [ ]:
outg1,outg2

## 16. outgroup selector for step 7

Line 17: Exclude taxa. When you are constructing data sets at step 7 you can exclude certain
taxa in order to maximize coverage among other included samples. For example you may
want to exclude taxa that are found to have very few data, or to exclude outgroups for
population analyses. List sample names comma separated here. Example:
In [ ]:
sample3,sample4

## 17. Exclude from step 7

Line 18: If your data are already de-multiplexed into separate files for each barcode then you
can skip step 1 in the analysis and go straight to step 2. Now, you must enter the location of
your sorted fastq files on this line. The reads should have the barcode trimmed o. If they also
have the cutsite trimmed o then enter the @ symbol at the beginning of the line to indicate
this. Files can be compressed (.gz) or not. Examples:
In [ ]:
~/RAD/fastq/*.fastq
@~/RAD/fastq/*.fastq

## 18. Sorted files


## 18. sorted and stripped files

Line 19: The maximum number of mismatches allowed in a barcode during de-multiplexing.
Default is 1, it's usually not necessary to adjust this.
In [ ]:
1

## 19. max barcode mismatches


Line 20: Phred Quality score oset (usually 33, but sometimes 64). Base calls with a phred
score below 20 are converted to Ns. If you want this value to be more strict you can change
the oset to be higher. For example, to raise the minimum score to 30 set the oset to 43. If
left blank the default is 33.

In [ ]:
33

## 20. min Quality score step 2


Line 21: Strictness of filtering in step 2. (0) is default and means no filtering for barcodes,
adapters and cut sites, only correct for quality scores of base calls. (1) looks for
cutsite+adapters and trims them out. (2) tries to detect these while allowing some errors in

them, thus enforcing the most strict filtering. For most data sets that do not include many
overlapping and short fragments options 0 or 1 is recommended. If you prepared your own
libraries instead of using a commercial service I highly recommend running at least filter=1 to
check for adapter sequences.
In [ ]:
0

## 21. strictness of Q filter


Line 22: a priori error rate and heterozygosity. Step 4 of the analysis jointly estimates the error
rate and heterozygosity, and step 5 uses these values to make base calls. If for some reason
you wish to skip the error rate estimate (e.g., your samples are polyploid), then you can enter
an a priori estimate of the error rate here. You could estimated this from data for a diploid
relative, for example. Caution: you should generally let the program estimate it by using step
4, as the error rate here is being measured on data that have already passed through some
filters, so error rates estimated from elsewhere may not be correct. Across several data sets I
typically find error rates between 0.0001 - 0.002, depending on length and sequence
chemistry. Example:

In [ ]:
0.0005,0.001

## 22. E and H, respectively

Line 23: maximum number of Ns in a consensus sequence. Clustering across samples can
be aected by the number of Ns, and so this number should be chosen with respect to line
6 and the lengths of reads. Default is 4.
In [ ]:
5

## 23. MaxN in consensus seqs


Line 24: maximum number Hs (heterozygous sites) in a consensus sequence. Among several
methods to exclude potential paralogs or highly repetitive genomic regions, you can set a
maximum number of heterozygous sites in a consensus sequence. Default is 4.

In [ ]:
5

## 24. MaxH in a consensus seq


Line 25: Ploidy (options=1,2, or any number >2). Default=2. If diploid this allows only 2
haplotypes in a consensus sequence after correcting for sequencing errors, and excludes
consensus seqs with more than 2 alleles. If set to 1 (haploid) then E,H are estimated with H
fixed to 0.0, and base calls are made with this estimated error rate and excludes consensus
seqs with more than 1 allele present. If ploidy is set >2 then E,H and base calls are made
using diploid assumptions, but seqs with more than 2 alleles are not excluded (see polyploidy
section in FAQs). Example:

In [ ]:
1

## 25. diploid filter max haplos=2


Line 26: Maximum number of SNPs in a final locus. This can remove potential eects of poor
alignments in repetitive regions in a final data set by excluding loci with more than N snps in
them. For paired data you can set max SNPs in the first read and second read separately, or
use one value for the pair, Examples:

In [ ]:
10
8,12

## 26. max SNPs in a final locus


## 26. max SNPs in 1st and 2nd pair read

Line 27: Maximum number of insertions/deletions. If only one value, applies to within-sample
clusters only. If two values, first is within-sample clusters, second is across-sample clusters.
For paired-data, enter four values: withinclusterread1, withinclusterread2,
acrossclustersread1, acrossclustersread2. Defaults are 3,6,99,99 respectively. Examples:
In [ ]:
3
3,10
3,6,10,10
ad2

## 27. max indels in a within-sample cluster


## 27. max within-sample, max for across-sample cluster
## 27. max within_read1,within_read2,across_read1,across_re

Line 28: random number seed. Randomization is used to sort the input order of sequences
before clustering in steps 3 and 6. This can have a (usually minimal) eect on results. For the
unlinked SNPs outputs (.usnps, .treemix, .str, .geno) one SNP is randomly sampled from each
variable RAD locus in step 7. The default seed is 112233, so analyses are repeatable by
default. You can change the seed to get dierent random sampling.
In [ ]:
998877

## 28. random number seed

Line 29: Allow overhanging ends of reads in final data set. If reads are dierent lengths or
overlap to dierent degrees, 1,1 will trim to shortest sequence on either side of locus; 0,1
would trim only the right side; 0,0 allows both ends to overhang. For paired data 1,1,1,1
would trim overhangs on left1,right1,left2,right2.
In [ ]:
1,1

## 29. overhang/trim ends of locus


Line 30: Output formats. * selects all possible. options=(p,n,a,s,v,u,t,m,k,g). List letters
comma-separated. See section on output formats.

In [ ]:
a,n

## 30. outputs formats


Line 31: Call simple consensus (majority base) on clusters with depth less than mindepth,
(allows you to set line 6 as low as 2). Still makes statistical base calls on high coverage sites.

In [ ]:
1

## 31. call majority base on low depth sites


Line 32: Keep trimmed sequences longer than X (default=0=don't keep). This option should
only be used for very messy single end GBS or ddRAD data, where a size selection method
did not work properly such that many fragments were shorter than the read length and thus
the adapters were frequently sequenced. Instead of discarding sequences with adapters, this
option trims o adapters and keeps the remaining sequence if it is longer than X.

In [ ]:
0
50

## 32. don't keep trimmed sequences


## 32. keep trimmed longer than 50
Line 33: maximum depth of a within-sample cluster. The default (empty) is
max(meandepth+2*SD, 500), meaning the higher value of either the mean plus two times the
standard deviation of cluster depth, or 500. If instead you want to set an absolute value enter
it here. Or enter a ridiculously high value for no maxdepth filtering.

In [ ]:
50000
9999999999

## 33. default max depth option


## 33. a set max depth
## 33. no max depth

line 34: minimum number of copies of de-replicated reads used for clustering. Default=1.
Reads are dereplicated, or collapsed, into a single read that records its number of
occurrences before clustering. The default is to not exclude any data. But for exploratory
analyses minderep=2 may be useful because this can drastically increase the speed of withinsample clustering. In general I do not recommend discarding data for your final analysis, but
for data sets with deep coverage it has little eect on results and greatly improves clustering
times.
In [ ]:
2
5

## 34. exclude singletons


## 34. exclude reads occurring less than 5 times
line 35: Use hierarchical clustering (default=No=0). See hierarchical clustering section. This
can be used to speed up across-sample clustering, and also to improve clustering of close
relatives when assembling a data set that also includes very divergent relatives.

In [ ]:
1

## 35. use hierarchical clustering


line 36: Repeat masking. Default=1=use default masking method in vsearch or usearch. If set
to 0 then no masking is used. Vsearch uses the 'dust' masking method whereas usearch has
a custom method.

In [ ]:
0

## 36. turn off repeat masking


line 37: Max number of threads per parallel clustering job used by vsearch. Default=6. This
has not been optimized yet. It is much faster than usearch even when processors x threads is
greater than the available number of cores, though this uses more memory! Until I or
someone else finds an optimal setting feel free to change this and see if you get a speed
increase or not.

In [ ]:
6

## 37. N vsearch threads

Remaining lines:
Group/clade assignments. These are used for "population" formatted output files (e.g., treemix,
migrate-n), and for hierarchical clustering. Each group is listed on a separate line, it is OK to leave a
line blank. Each line will contain three elements, each separated by spaces or tabs: (1) a group name
(e.g., 'A' or '1'), these names are arbitrary, just choose something dierent for each group. (2) the
minimum size of a cluster within this group, any cluster smaller than this will not be clustered acrossgroups, or not output in the assembled data set. (3) the names of samples/individuals in this group.
Sample names can be individually listed comma separated, or selected using a wildcard selector (e.g.,
*). Example:

In [ ]:
A 4 1A0,1B0,1C0,1D0
B 4 2*
C 4 3*

4. Running pyRAD
pyRAD should either be called from the directory in which it comes (i.e., with its other dependency .py
files), or you can create a symlink to call the program from anywhere. For example, the command
below would create a link so that you call pyRAD by simply typing 'pyRAD' if you were to put the link in
your $PATH.

In [ ]:
ln -s ~/pyrad_v.3.0/pyRAD pyRAD
If the symlink business seems confusing, don't worry. You can just call pyRAD from its directory. For
example, to call pyRAD if it had been downloaded to a users home/ directory, type the command
below, which will print the help menu.

In [ ]:
~/pyrad_v.3.0/pyRAD -h
Or if you created a symlink:

In [ ]:
pyRAD -h

There are six options to the command line interface of pyRAD:


-h (help screen)
-n (generate new params file)
-p (designate params file)
-s (choose steps)
-d (D-test input file)
-D (generate new dtest input file)

The parameter input file (params.txt) should be created using the -n option and then filled in based on
the type of analysis you plan to do. If the params file is properly filled out then the entire analysis
converting raw RADseq data into a assembled de novo loci can be done by simply entering:

In [ ]:
pyRAD -p params.txt
This will perform all steps (1-7) of the analysis, as described below. If, however, you wish to perform
only one, or a few step at a time, then you can select these using the -s (steps) option. For example, to
select only the first step:

In [ ]:
pyRAD -p params.txt -s 1
And to run steps 2-7 you could enter:

In [ ]:
pyRAD -p params.txt -s 234567
After finishing one assembly, you could rerun step 7 with the same or a dierent params.txt file
indicating dierent step 7 filtering options.

In [ ]:
pyRAD -p params.txt -s 7
When learning new software, or applying it to a new data set, it can take time to find the appropriate
parameter settings and options. This can be espeicially frustrating with computationally intensive
analyses, where one might have to wait many hours (or days) before finding they made a simple
mistake. The -s option is an attempt to reduce this annoyance by breaking the analysis into seven
discrete jobs, each of which is performed sequentially. In this way, you know whether step 1 worked
before moving to step 2. And if step 2 fails, you retain the results of step 1, and can start again from
where the error arose. I suggest running the example data sets (tutorials) to make yourself familiar with
the steps of an analysis and what the output files and statistics look like before beginning your own
analysis.

The seven steps described


Step 1: de-multiplexing. This step uses information from the barcodes file to separate sequences from
your raw fastq files into a separate file for each sample. These are placed in a new directory within your
working directory called fastq/. File names are not important for single end data. However, for pairedend reads it is necessary that the raw data file names follow a specific format: The first read files must
contain "_R1_" in the name, and the second read files must be identical to the first read files but with
"_R2_" in place of "_R1_". Here is an example pair of input files:

In [ ]:
yourfilename_R1_001.fastq
yourfilename_R2_001.fastq
A reminder, if your data are already demultiplexed this step can be skipped. You will not need to enter
a location for your raw data in the params file, but instead you must enter the location of your
demultiplexed files into the proper location on line 18 of the params file.

Step 2: filtering. This step uses the Phred quality score recorded in the FASTQ data files to filter low
quality base calls. Sites with a score below a set value are changed into Ns, and reads with more
than the number of allowed Ns are discarded. Files are written to the edits/ directory with the sux
.edit. It also implements a number of optional filters.

Step 3: within-sample clustering. This step first dereplicates the filtered sequences from step 2,
recording the number of times each unique read is observed. These are then clustered using
VSEARCH, recording if they match within a set sequence similarity. Sequences that clustered together
are then aligned and written to a new file in the clust.xx/ directory with the ending .clustS.gz.
Step 4: error-rate and heterozygosity estimates. This step uses the ML equation of Lynch (20XX) to
jointly estimate error rate and heterozygosity from the base counts in each site across all clusters.
Results are written to the Pi_estimate.txt file in the stats/ directory.
Step 5: create consensus sequences. Using the mean error rate and heterozygosity estimated in step
4, this step creates consensus sequences for each cluster. Those which have less than the minimum
coverage, more than the maximum number of undetermined sites, or more than the maximum number
of heterozygous sites, or more than the allowed number of alleles, are discarded. In diploid data if two
alleles are present the phase of heterozygous sites are retained in the consensus sequences.
Step 6. Consensus sequences are clustered across samples using the same settings as in step 3. If
heterozygous, one allele is randomly sampled and used in this step, although both alleles are retained
in the final data set.
Step 7. Alignment, filtering for paralogs, and output of human readable fasta-like file (.loci). A large
number of alternative formats are also available (see output formats). Final assembly statistics are
written to the stats/ directory in the .stats file. This step is relatively fast, and can be repeated with
dierent values for options 12,13,14,16,17,etc. to create dierent data sets that are optimized in
coverage for dierent samples.

5. RADseq/RRL datatypes
The following reduced representation library data types are supported by pyRAD, and I list beside each
my general experience with analyzing empirical data sets. My quality judgements are based on the
frequency with which users have generated data sets that include many short fragment reads with
adapter sequences, or which require extra types of analyses (e.g., reverse complement matching for
overlapping or paired-end GBS.) This may change as library preparations of GBS and ddRAD libraries
are improving. I describe each data type and how pyRAD treats them dierently below.
RAD
ddRAD
GBS
nextRAD
paired ddRAD
paired GBS
other types

(very good)
(usually good)
(sometimes difficult/messy)
(very good)
(sometimes difficult/messy)
(very difficult/messy)
(... I haven't tried yet)

RAD

Cut site

Digested genome
DNA fragments
Sonication and
size selection
Sequencing

Data

5'

R1 files

barcode:common cut site:first read

ddRAD

Common cut site

Rare cut site

Digested genome
DNA fragments
Amplification and
size selection
Sequencing

Data

5'

R1 files

barcode:common cut site:first read

GBS

R2 files (reverse complemented)


second read:rare cut site

Common cut site

Digested genome
DNA fragments
Amplification and
size selection
Sequencing

Data

5'

R1 files

barcode:common cut site:first read

R2 files (reverse complemented)


second read:rare cut site

In general single-end Illumina data that employ a good size selection window will produce very good

and usable data by any of the library prep methods above. Problems tend to arise when the size
selection is too small (or non-existent), and especially if this coincides with paired-end sequencing.
Poor size selection has little eect on traditional RAD or on single end ddRAD, as you can see above,
except to include adapter sequences in the reads which can be filtered out. It has a more significant
eect on paired ddRAD, and a very significant eect on single or paired GBS data. This is because
short fragments will be sequenced from both ends of these types of data, leading to sequence overlap.
In paired ddRAD the first and second reads come in pairs, and pyRAD can test for overlap of these
reads using the merge overlap option. This will remove merged reads and write them to a separate file.
These can be analyzed separately from the non-overlapping reads if they make up a large proportion
of your data set. Recommendations for this type of analysis are made in the paired ddRAD tutorial. I
only recommend trying to rescue merged reads if they comprise a large proportion of your data.
For GBS data the problem is more dicult. In single end data the Illumina adapter can ligate to either
end of fragments, meaning short single-end fragments can have overlap leading to duplication in the
data set. pyRAD can correct for this by performing reverse-complement clustering. This is much
slower than normal clustering. It detects overlap of the reads and groups them into partially or fully
overlapping clusters that are then used to make consensus base calls to create merged contigs.
Recommendations for this type of analysis are made in the GBS tutorial.
For paired GBS data the merge overlap option can merge overlapping reads, however the first and
second reads are completely interchangeable, so reverse complement clustering is also required in this
case.

6. Example Tutorials
See Example Tutorials at this page (http://www.dereneaton.com/software/pyrad)

7. Output data formats


sequence files
By default pyRAD will return the final data set and excluded loci in the easily readable .loci format.
(.loci file): human-readable individual loci file
(.excluded_loci): loci that did not pass step 7 filtering
These data can be transformed into a large number of additional formats, each listed by a single letter
in the params file:
a: (.alleles): .loci file with two alleles separated (diploid analyses only).
p: (.phy): non-interleaved phylip file used by raxml.
n: (.nex): interleaved nexus file used by BEAST.
snp files

s: (.snps): all SNPs in the assembled data set, see example.


v: (.vcf): variant call format of all SNPs.
sampled snp files
u: (.usnps): unlinked SNPs, a random sample of one SNP per locus.
k: (.str): Structure format version of unlinked SNPs file.
g: (.geno): Admixture format version of unlinked SNPs file.
population seq or snp files
And two more complicated output formats that allow pooling individuals into populations/clades using
the assignments in the params file. See instructions in the Example RAD tutorial.
t: (.treemix): TreeMix formatted unlinked SNPs file w/ pooled samples and filtered by pop
coverage.
m: (.migrate): migrate-n formatted loci w/ pooled samples and filtered by pop coverage.
Examples are best viewed in the RAD tutorial above, but are also described below.

*.loci
This is a custom format that is easy to read, showing each individual locus with variable sites
indicated. Custom scripts can easily parse this file for loci containing certain amounts of taxon
coverage or variable sites. Also it is the most easily readable file for assuring that your analyses are
working properly. a (-) indicates a variable site, a (*) indicates the site is phylogenetically informative.
Example:

For paired-end data the two linked loci are shown separated by a 'xxxx' separator.

*.phy
This is a phylip formatted data file which contains all of the loci from the .loci file concatenated into a
supermatrix, with missing data for any sample filled in with N's. This format is used in RAxML among
other phylogenetic programs.

*.nex
This is a nexus formatted data file which contains all of the loci from the .loci file concatenated into a
supermatrix, but printed in an interleaved format, with missing data for any sample filled in with N's,
and with data information appended to the beginning. This format is used in BEAST among other
phylogenetic programs.

*.alleles
This is the same as the .loci format but instead of ambiguity codes for the consensus sequences the
two alleles are printed out for each sample at each locus. Output of polyploid (>2) alleles is not
supported.

*.snps
This format lists only the variable sites in each locus with a space between SNPs from dierent loci. An
underscore represents an invariable locus. Loci are in the same order as in the .loci file. Paired loci are
treated as a single locus, meaning SNPs from the two reads are not separated in this file (they're
linked).

*.unlinked_snps
Each column contains one SNP sampled from one locus. If multiple SNPs in a locus, SNP sites that
contain the least missing data across taxa are sampled, if equal amounts of missing data, they are
randomly sampled.

*.excluded_loci
The excluded loci file is in the same format as the .loci file but shows the loci that were filtered out in
step 7. The filter is listed below each locus. %P indicates the locus was filtered as a paralog because
the samples shared a heterozygous site over more than the allowed number of samples. %D indicates
the locus had more than one sequence from the same individual. %S indicates there were more SNPs
detected in the locus than the set maximum allowed number. %I means there were more indels in the
locus than the max allowed number.

In [ ]:
>A
>B
>C
>D
>E
>F
//%P
>A
>B
>C
>D
>D
//%D
>A
>B
>C
>D
//%S
...

CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCCCCCGCGATCT
CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCACCGCMCCCGCGATCT
CCRACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCAGCACCCCCGCGATCT
CCAACACAGAGGGCCAATAGACACGTAGCTCGAGATCTCAGCAACCCCGCGATCT
*
*-|
CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
|
CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGACCGAGAACTAGGT
CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGGACGGACGGACGGA
CGTACAATAGGATGCACCGTTCGTAAAGTTCTTGACGGTGGGACGGACGGACGGA
CGTACAATAGGATGCACCGTTCGTCAAGTTGTTGACGGTGGACCGAGAACTAGGT
** ******* *|

8. Hierarchical clustering
Clustering across samples (step 6 of the analysis) can be quite fast for data sets with few sampled
individuals, but for large studies (many hundreds of individuals) this step can sometimes take very long
(>1 week), depending on the total number of unique sequences. Hierarchical clustering splits this job
up into multiple smaller jobs that each run on a separate processing core, allowing clustering of large
data sets many times faster.

(I no longer recommend excluding data through hierarchical clustering since clustering times are much
shorter with vsearch, perhaps only for very very large data sets). The most significant speed increase
comes from excluding loci at the first step of clustering that have significant missing data within a
subgroup. By excluding singletons, or loci with little shared data early in the clustering process time is
not spent comparing them to all of the other more distantly related samples. This is particularly
relevant if you plan to only use loci in your final data set that recover data shared across many
individuals.
However, without excluding data hierarchical clustering can still be useful, particularly for studies of
highly divergent organisms. Because the input order of sequences can eect their clustering, the use
of hierarchical clustering can be used to better cluster very highly divergent sequences by making
close relatives cluster to each other first, and then cluster to more distant groups.

step 1

step 2

stack

stack-min2

The figure above illustrates the general concept. Individuals are grouped a priori into a small number of
clades or populations, in this case a, b, and c, through their assignment listed in the params file. The
assignment of individuals does not have to be perfectly correct in terms of knowing their true
phylogeny, rather this is an approximation or heuristic, where those grouped together are expected to
share more data than those which are not grouped together.
In this case each clade has four sampled taxa. To group our samples into these three clades we assign
them to groups in the bottom lines of the params file.
Each group is listed on a separate line, it is OK to leave a line blank. Each line will contain three
elements, each separated by spaces or tabs: (1) a group name (e.g., 'A' or '1'), these names are
arbitrary, just choose something dierent for each group. (2) the minimum size of a cluster within this

group, any cluster smaller than this will not be clustered across-groups, 1=do not exclude any data. (3)
the names of samples/individuals in this group. Sample names can be individually listed comma
separated, or selected using a wildcard selector (e.g., *).
The sample names that this is matching to are the prefix names to the ".consens.gz" files located in
the clust directory. For example, imagine we have the samples below.
In [ ]:
sample1.consens.gz
sample2.consens.gz
sample3.consens.gz
sample4.consens.gz
sample5.consens.gz
sample6.consens.gz
sample7.consens.gz
sample8.consens.gz
sample9.consens.gz
Imagine also that we know samples 1-4 are individuals of species A, samples 5-8 are individuals of
species B, and sample9 is from a more distant outgroup species. We could cluster them using the
following settings:

In [ ]:
A 4 sample1,sample2,sample3,sample4
B 2 sample5,sample6,sample7,sample8
C 1 sample9
The minimum cluster size option tells pyRAD to only cluster loci across clades that have at least N
matches within a clade. In this way loci that are singletons, or shared by very few taxa are not
clustered across clades. In the case of the outgroup sample C, we entered 1 for the minimum size, and
thus all reads from this sample are clustered to the other two clades. In contrast, we chose 4 for clade
A, and 2 for clade B. This means only loci that have data for all four samples in clade A will be
clustered with clades B and C, and only loci with data from at least 2 individuals in clade B will be
clustered with clades A and C.
Following this logic, you can assign individuals to clades in a way that speeds clustering but also
maximizes the amount of data retained in your data set. If a clade has very few individuals in it, or
those individuals have less data overall than other samples then you may want to set a low minimum
cluster size to discard as little data as possible.
There is an example tutorial implementing hierarchical clustering and which provides more detail in the
Examples section above.

9. FAQs
coming soon...

Using population assignments in step 7


For the migrate-n and treemix outputs the minimum size option in the populations assignments filters
out loci that have data for fewer than this number of individuals for that population. Only loci with data
for at least the minium number of individuals across all populations are included in the data set.

You might also like