0% found this document useful (0 votes)
82 views17 pages

Wiley American Journal of Botany: This Content Downloaded From 181.231.29.59 On Tue, 26 Mar 2019 21:10:07 UTC

rogk rñlmro cgmlkm 5logm4lfm 4lxfm hiome

Uploaded by

Franco Santin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views17 pages

Wiley American Journal of Botany: This Content Downloaded From 181.231.29.59 On Tue, 26 Mar 2019 21:10:07 UTC

rogk rñlmro cgmlkm 5logm4lfm 4lxfm hiome

Uploaded by

Franco Santin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Navigating the tip of the genomic iceberg: Next-generation sequencing for plant

systematics
Author(s): Shannon C. K. Straub, Matthew Parks, Kevin Weitemier, Mark Fishbein, Richard
C. Cronn and Aaron Liston
Source: American Journal of Botany, Vol. 99, No. 2, Methods and Applications of Next-
Generation Sequencing in Botany (February 2012), pp. 349-364
Published by: Wiley
Stable URL: https://www.jstor.org/stable/41415366
Accessed: 26-03-2019 21:10 UTC

REFERENCES
Linked references are available on JSTOR for this article:
https://www.jstor.org/stable/41415366?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Wiley is collaborating with JSTOR to digitize, preserve and extend access to American
Journal of Botany

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
L3 Л 7 American Journal of Botany 99(2): 349-364. 2012.
Navigating the tip of the genomic iceberg:
Next-generation sequencing for plant systematics1
Shannon С. K. Straub2-5, Matthew Parks2, Kevin Weitemier2, Mark Fishbein3,
Richard C. Cronn4, and Aaron Liston2

department of Botany and Plant Pathology, Oregon State University, 2082 Cordley Hall, Corvallis, Oregon 97331 USA;
3Department of Botany, Oklahoma State University, 104 Life Sciences East, Stillwater, Oklahoma 74078 USA; and
4Pacific Northwest Research Station, USDA Forest Service, 3200 SW Jefferson Way, Corvallis, Oregon 97331 USA

• Premise of the study: Just as Sanger sequencing did more than 20 years ago, next-generation sequencing (NGS) is poised to
revolutionize plant systematics. By combining multiplexing approaches with NGS throughput, systematists may no longer
need to choose between more taxa or more characters. Here we describe a genome skimming (shallow sequencing) approach
for plant systematics.
• Methods: Through simulations, we evaluated optimal sequencing depth and performance of single-end and paired-end short
read sequences for assembly of nuclear ribosomal DNA (rDNA) and plastomes and addressed the effect of divergence on
reference-guided plastome assembly. We also used simulations to identify potential phylogenetic markers from low-copy
nuclear loci at different sequencing depths. We demonstrated the utility of genome skimming through phylogenetic analysis of
the Sonoran Desert clade (SDC) of Asclepias (Apocynaceae).
• Key results: Paired-end reads performed better than single-end reads. Minimum sequencing depths for high quality rDNA and
plastome assemblies were 40x and 30x, respectively. Divergence from the reference significantly affected plastome assembly,
but relatively similar references are available for most seed plants. Deeper rDNA sequencing is necessary to characterize
intragenomic polymorphism. The low-copy fraction of the nuclear genome was readily surveyed, even at low sequencing
depths. Nearly 160000 bp of sequence from three organelles provided evidence of phylogenetic incongruence in the SDC.
• Conclusions: Adoption of NGS will facilitate progress in plant systematics, as whole plastome and rDNA cistrons, partial
mitochondrial genomes, and low-copy nuclear markers can now be efficiently obtained for molecular phylogenetics studies.

Key words: Asclepias ; chloroplast content; genome skimming; Illumina; next-generation sequencing; low-copy nuclear
gene; plant systematics; plastome; reference-guided assembly; ribosomal DNA cistron.

For almost two decades, plant systematists have used 2010) DNA and nuclear genome (e.g., Shulaev et al., 2011) have
sequences to characterize and classify plant diversity. These demonstrated the potential for larger nucleotide data sets to
studies have had a profound influence on our understanding increase
of the resolution of, and support for, phylogenetic hy-
plant phylogenetic relationships and patterns of diversification. potheses. The increasing accessibility and affordability of next-
This progress is all the more remarkable when one considers generation sequencing (NGS) provides an impetus for plant
that most of these studies used only hundreds to thousands systematists
of to transition from Sanger sequencing of a small
nucleotides, representing a tiny fraction of the millions number to bil- of loci to genome sequencing that provides access to
lions of base pairs in a land plant genome. Phylogenetic analy- kilobases of data from three organelles (chloroplast, mitochon-
ses using kilobase- scale sampling of the chloroplast (e.g.,drion, Parks,and nucleus) in a single run (Steele and Pires, 201 1). The
Cronn, and Liston, 2009; Givnish et al., 2010; Moore combination et al., of the immense yield of the currently most cost
effective platform (Illumina HiSeq 2000) with multiplexing ap-
proaches (Cronn et al., 2008; Meyer and Kircher, 2010) means
1 Manuscript received 19 July 201 1; revision accepted 7 October 201 1.
that systematists may no longer need to choose between more
The authors thank W. Phippen (Western Illinois University) and T. more taxa.
loci or
Livshultz (Academy of Natural Sciences) for supplying leaf tissue; S.
In contrast to the Sanger sequencing traditionally used for
Lynch for providing DNA samples; M. Dasenko, M. Peterson, and C.
Sullivan (Oregon State University Center for Genome Research plantandmolecular systematics studies, NGS is quantitative. Thus,
in
Biocomputing) for Illumina sequencing and data analysis support; T. addition to obtaining the primary nucleotide sequence, NGS
Jennings and J. Swanson (USDA Forest Service) and L. Mealyapproaches
and N. can provide a count of the number of times each
Nasholm (Oregon State University) for laboratory assistance; Z.base is sequenced. This concept of sequencing depth is central
Foster
to theS.utilization of NGS data. If the target is a complete nuclear
and K. Hansen (Oregon State University) for data analysis assistance;
Meyers and T. Mockler (Oregon State University), M. Moore (Oberlin
genome, sequencing depth can be estimated as the total number
College), and D. Soltis (University of Florida) for access to unpublished
of base pairs obtained divided by the genome size. However,
data; and B. Knaus (USDA Forest Service) and L. Wilhelm (Oregon State
this calculation is only accurate for parts of the genome that are
University) for access to perl scripts. The authors acknowledge funding
present in a single copy. Repetitive sequences comprise a sub-
from a U. S. National Science Foundation Systematic Biology Program
stantial fraction of plant genomes, and the sequencing depth of
grant (DEB 0919583) to R.C.C., M.F., and A.L.
a repeat will rise in proportion to its copy number in the ge-
5 Author for correspondence (e-mail: [email protected])
nome. For example, transposable element content in sequenced
doi:10.3732/ajb. 1100335 plant genomes ranges from 14% in papaya (Ming et al., 2008)

American Journal of Botany 99(2): 349-364, 2012; http://www.amjbot.org/ © 2012 Botanical Society of America

349

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
350 American Journal of Botany [Vol. 99

to >75% in maize (Baucomdations et al., for genomics-en


2009). Likewise,
depth of the plastid andstudies,
mitochondrial we also genomes
surveyedw
proportion in sequences 100 Illumina
obtained libraries
from total we
gen
nomic DNA extractions.
to their relative abundance in total genomic DN
clear genome sequencing (Fig. 1) results in r
sequencing of plastome, mitochondrial genome,
MATERIALS AND METHODS
nuclear sequences. Although assembly of the co
genome requires deep sequencing (Schatz et al., 2
shallow sequencing can Genome
be skimming
used in silicoto
- Data subset creation - A starting read p
characterize
consisted of 44489078 paired-end 80-bp Illumina reads (ca. 450-bp inser
clear loci for molecular systematics (Fig. 1; Strau
1.3% rDNA, 23.8% cpDNA) from Asclepias syriaca L. obtained from a sin
Although various NGS technologies exist or ar
lane of Illumina sequencing on the Illumina GAIIx for the Milkweed Gen
velopment, we have chosen to focus
Project (http://milkweedgenome.org; on
A. Liston, the
unpublished Illu
data). This cor
because it provides the lowest
sponds to approximately sequencing
4.5x coverage of the A. syriaca cost
nuclear genome,p
et al., 2010) and has straightforward
suming a previously estimated genome size sample
of 817 Mbp (Straub etpre
al., 20
To create the data subsets for
cols amenable to multiplexing and the in silico
othergenome skimming
customsimulation
data from this platformused perl scripts (B. Knaus, USD A for
accounts Forest Service,
the and L. vast
Wilhelm, Orem
State University, unpublished) implementing the Fisher- Yates Shuffle a
NGS data collected and deposited in short read
rithm (Durstenfeld, 1964) to make random draws from the starting read p
positories to date (Leinonen
Nine different sequencinget al.,
depths were In0.042011).
simulated
О.ОЗх, (0.0 lx, 0.02x,
describe our utilization of
0.05x, O.lx, 0.5x, Illumina technology
lx, 4x), either retaining the paired-end information or tre
high copy fraction of the
ing the reads genome
as single-end. toin the
More subsets were made obtain
lower part of t
quences of nearly complete
sequencing depth range plastid genomes
because we expected the minimum sequencing andept
somal DNA (rDNA), for success
asbased wellon various metrics
as (seekilobase
below) to fall within that range.
po
subsamples for each sequencing depth-read type combination were made, f
mitochondrial genome. Genome skimming c
total of 90 read pools for downstream analyses.
partial sequences of low-copy nuclear loci, suff
signing PCR primers or probes for hybridizati
Testing for effects of sequencing depth and read type on assembly succe
nome reduction approaches (this volume: Cronn
Burrows- Wheeler transformation implemented in the program BWA v. 0.5.7
Through simulations,andwe determine
Durbin, 2009) was used to map reads in the optimal se
data subsets to the published
and evaluate the performance
syriaca plastome (GenBank of single-end
JF433943) and rDNA cistron (GenBank and JF3 1204
quence data sets for (1) the
sequences. assembly
The program SAMtools v. 0.1.7 (H. Liof riboso
et al., 2009) was used to
plastome sequences andvert the(2)
SAM output
the from BWA to a SAMtools pileup and to produce a cons
characterization
sus sequence. The number of unambiguously called bases in the consen
low-copy nuclear loci. We also address the effec
sequence was determined, as were the number of bases in the consensus
divergence from a reference for reference-guid
quence with sequencing depth of five or greater. Differences in assembly suc
sembly. We then demonstrate the
across sequencing depths based utility
on these two measures were of thi
assessed for
plant molecular systematics in
single-end and paired-end setsa phylogenetic
separately using one-way ANOVA plus Tuke
"Sonoran Desert clade"
Kramer (SDC)
honestly significantof Asclepias
difference L. c
(HSD) tests to correct for multiple
In addition to parisons. To assess
phylogeny, thethe effect of using single-end
SDC example or paired-end data,
higwe us
Wilcoxon signed rank test. Statistical tests were implemented in the progr
tional insights gained from the quantitative
JMP 9.0.2 (SAS Institute, Cary, North Carolina, USA).
through characterization of intragenomic rD
phism. Finally, to help generalize our results i
Assessing the utility of low coverage genomes for phylogenetic marker d
velopment - Ninety data sets were created using the read pool subsamp
process described earlier with scripts that output fasta rather than fastq for
and eliminate N-containing reads (B. Knaus, USD A Forest Service, unp
lished data). The conserved orthologous sequence set for asterids (CO
Wu et al., 2006), consisting of 1086 single copy genes, was used to determi
the utility of these read subsamples for marker development for phylogene
studies following the strategy outlined in Straub et al. (201 1). Within each r
type (single-end, paired-end), a one-way ANOVA followed by a Tuk
Kramer HSD to correct for multiple tests was used to determine the effec
sequencing depth on two metrics of suitability for primer design: the number
COSH loci with any hits and the number of loci with two or more hits. To as
the effect of single-end vs. paired-end data, a Wilcoxon signed rank test w
used. Tests were implemented in JMP 9.0.2.

Effects of divergence from the reference on plastome assembly - Along w


the A. syriaca plastome (GenBank JF433943), three additional plastome refe
ences were obtained for testing: Nerium oleander L. (Apocynaceae; M. Moor
Oberlin College, and D. Soltis, University of Florida, unpublished data), Cof
arabica L. (Rubiaceae; GenBank EF044213.1), and Nicotiana sylvestris Spe
(Solanaceae; GenBank AB237912.1). Sequences were aligned in the progra
MAFFT v. 6.857 (Katoh and Toh, 2008) using default settings. The pairw
distances between the aligned plastome references were calculated in the p
gram MEGA v. 5.05 (Tamura et al., 201 1). Each sequence, including only
copy of the inverted repeat, was used for reference-guided assembly in
Fig. 1. The genomic iceberg: the
Alignreads v. 2.25 relationship
pipeline (Straub et al., 2011) for each betwee
of the five O.lx
gets, the sequencing depth required
quencing to
depth single-end data obtain them,
subsets. The similarity and
setting for the YASt
ate method of sequence assembly.
(Ratan, 2009) component of the pipeline was set to "same" (>95% identity

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 35 1

involving
the reference) for the Asclepias reference and "medium" (>85% mitochondrial
identity toorthe
nuclear pseudogenes) or very high sequencing
reference) for the other three references. For each of the depth
three(possible misassembled
divergent refer- repeats) relative to the surrounding region and
ences, assembled contig sequences were then aligned to problem the A. areassyriaca
were masked.
plastome
sequence using the alternate-ref option in Alignreads. Consensus An iterativeplastome
approach was se- used to obtain the rDNA cistron sequences for
quences were masked at positions with sequencing depth each less than
individual offive reads
the SDC. First, reads from each individual were aligned
and single nucleotide polymorphisms (SNPs) vs. the reference against the were masked
A. syriaca if (GenBank JF3 12046) with the external tran-
reference
sequencing depth was not at least 25 with a call proportion of 0.8
scribed spacer (Whittall spacer trimmed. A low masking threshold
and nontranscribed
et al., 2010). Three regions prone to misassembly, two involving mononucleotide
was used to allow the consensus to differ from the reference (Alignreads option
repeats and one involving the accD pseudogene (see Straub et The
-w 1 0.5). al., A.
201 1), were
subulata 423 assembly resulted in a single contig with few
removed from each assembly prior to downstream analyses. ambiguous base calls, and
Additional so was
sites in chosen to act as a new SDC reference. Reads
the assembly consensus sequence were then masked by from handeach in SDC
the individual
alignment were then aligned to this A. subulata reference in
editor BioEdit v. 7.0.5.3 (Hall, 1999) for the following Alignreads three cases: (1) stringent
with more gaps in masking parameters (-w 5 -x 25 0.5).
coverage of the reference due to gaps between contigs, (2) introduced
Partial ambigu-
mitochondrial genomes were assembled using the five longest con-
ity codes due to incorrectly assembled indels or SNPs tigsbetween
from the A.overlapping
syriaca mitochondrial genome assembly (Straub et al., 201 1) as
contigs, or (3) long stretches of sequence divergent from references with separate
the reference assemblies for each contig. Because of the expected
present
at contig ends. The amount of unmasked consensus sequence, presencenumber
of chloroplast DNA in plant mitochondrial genomes (Knoop et al.,
of con-
tigs, longest contig, and N50 for each assembly were determined using
201 1), these contigs BioEdit
were screened for similarity to chloroplast sequences using
BLAT, but none was detected.
and a perl script, contig-stats.pl (available at http://milkweedgenome.org). Dif- Because of the low sequencing depth, the mito-
ferences in these metrics were determined using a one-way ANOVA
chondrial assemblies wereplus
masked only for positions with less than five cover-
Tukey-Kramer HSD test to correct for multiple comparisons ing reads. Consensus sequences
implemented in were trimmed to match the lengths of the
JMP 9.0.2. references and masked where there were gaps between the contigs when aligned
to the references and places where the contig sequences showed little similarity
to the reference, likely indicating rearrangements.
Genome skimming for phylogenetics in the Sonoran Desert clade of milk-
weeds ( Asclepias ) - Sampling, DNA extraction, and Illumina sequencing -
Eleven individuals from the six species of the Sonoran Desert clade of Asclepias Sequence alignment and phylogenetic inference - Consensus plastome se-
quences were then aligned using the program MAFFT v. 6.857 with default
(clade I plus Asclepias cutleri and A. leptopus; Fishbein et al., 201 1), including
one hybrid individual, and three individuals of two outgroup species (A. coul- settings. Alignment for the mitochondrial sequences was trivial because of high
teri and A. macrotis) were sampled (Table 1). Genomic DNA was extracted and sequence similarity and previous alignment to the references with the program
prepared for sequencing on the Illumina GAIIx platform following Straub et NUCmer al. v. 3.07 (Delcher et al., 2002; Kurtz et al., 2004) as part of the Align-
(2011) with the following modifications: (1) fragmented DNA was ligated reads to pipeline. The rDNA cistron sequences were aligned using local pair align-
adapters carrying unique 4-6-bp "barcodes" (Craig et al., 2008; Cronn et al., ment with 1000 iterations in MAFFT v. 6.240. One ambiguous area where two
2008) to allow multiplexing, (2) reactions were done at one half the recom- contigs poorly overlapped in A. cutleri 382 required manual curation from the
read pileup. Each alignment also contained the A. syriaca reference sequence
mended volume, and (3) in several cases agarose gel-based size selection was
used in the assembly process to serve as an additional outgroup. All alignments
not used. In the latter case, we relied on sonication to minimize large molecular
weight fragments (>1000 bp) and purification with Agencourt AMPure (Beck- were corrected by eye. Areas of likely misassembly (polynucleotide runs, AT-
rich repeats, the accD pseudogene), areas with inversions in some sequences,
man Coulter Genomics, Danvers, Massachusetts, USA) to minimize low mo-
lecular weight fragments (<200 bp), including adapter/primer dimer
including differences due to the shifting boundary of the inverted repeat, and
contamination (Lennon et al., 2010). AMPure was used both after ligationother of difficult to align regions were removed from the plastome matrix prior to
adapters and after PCR enrichment at ratios of 0.7-1.1 : 1 AMPure to reaction phylogenetic analysis. The matrices for the five mitochondrial references were
concatenated into a single matrix using the program WinClada v. 1 .7 (Nixon,
volume. The specific ratio varied depending on the fragmentation profile of the
1999-2004) and sequences with more than 30% missing data were removed.
library as determined by gel electrophoresis after sonication and after enrich-
Plastome and rDNA cistron sequences were submitted to GenBank (Table 1).
ment (i.e., higher ratios were used for relatively high molecular weight librar-
ies, and vice versa). Libraries were quantified using a Qubit fluorometer Mitochondrial contig sequences are available at http://milkweedgenome.org.
(Invitrogen by Life Technologies, Carlsbad, California, USA) and multiplexed The numbers of variable and informative characters were calculated in MEGA
in equimolar ratios. Eighty base-pair single-end reads were sequenced on 5.05.an Each matrix was then analyzed separately using maximum parsimony
Illumina GAIIx sequencer (Illumina, San Diego, California, USA) at the Center (MP) and maximum likelihood (ML) optimality criteria.
for Genome Research and Biocomputing at Oregon State University. Resulting Parsimony analyses were conducted in the program TNT v. 1.1 (Goloboff
data were analyzed using Illumina' s Real Time Analysis v. 1.6 or 1.8 and
et al., 2008) using implicit enumeration (branch and bound) to find exact solutions
CASAVA v. 1.6 or 1.7. Adapter reads were removed and barcoded reads sorted and with uninformative characters deactivated. The resulting trees were viewed
using perl scripts (Knaus, 2010). and characterized using WinClada v. 1.7. The program PAUP* v. 4.0Ы0
The sequencing depths for rDNA and the plastid and mitochondrial ge- (Swofford, 2003) was used to perform 10000 branch and bound bootstrap rep-
nomes were determined using the BLAT assembler (Kent, 2002) with default licates with all characters activated. Parsimony bootstrap support was calcu-
lated for the most likely tree topology from the ML analyses.
parameters to determine the number of unique read hits to the A. syriaca plas-
tome (GenBank JF433943) and rDNA (GenBank JF3 12046, JF3 12047) se- Maximum likelihood analyses were accomplished using the program
RAxML v. 7.2.6 (Stamatakis, 2006) estimating sequence evolution using the
quences and the five longest contigs from the A. syriaca mitochondrial genome
GTRGAMMA model. Of the models of molecular evolution available in
assembly (Straub et al., 2011). The approximate sequencing depth for the nu-
RAxML, we chose to use the full model, rather than those involving approxi-
clear genome was determined using the reads not identified as chloroplast or
mitochondrial and the genome size of A. syriaca because the genome sizes mations
of of the GTR plus gamma model for increased computational efficiency,
because our analysis could be run in a reasonable amount of time due to the
the Sonoran Desert clade species are unknown. All were assumed to be diploid
because no polyploids have been identified in Asclepias (2 n = 2x = 22). small number of terminals. The chloroplast and mitochondrial genome matrices
were unpartitioned, but the rDNA matrix was partitioned into ribosomal and
Sequence assembly and masking - Assemblies for each individual were ac- spacer region sequence. Support was assessed with 10000 rapid bootstrap rep-
complished using the program Alignreads v. 2.25 (Straub et al., 2011). Whole licates (Stamatakis et al., 2008). Trees were viewed in FigTree v. 1.3.1
plastomes were assembled using the A. syriaca reference (GenBank JF433943)(Rambaut, 2006-2009).
with only one copy of the inverted repeat present and the same masking param-
eters as described already for the divergent reference tests. The Alignreads out- Intragenomic rDNA polymorphism analysis - Read pools for each individ-
put was checked in BioEdit. Areas with multiple contig alignments indicating ual were filtered to remove reads with an average Phred quality score below
20 using the FASTX-Toolkit (Gordon, 201 1). Within the remaining reads, any
insertions or deletions vs. the reference were expanded or contracted to incor-
porate these differences. Read pileups for areas with many SNPs, introduced bases with a score below 20 were converted to 'N'. The quality filtered reads
IUPAC ambiguity codes, or other features that indicated possible misassembly were aligned against their own rDNA reference using BWA 0.5.7. From this
were checked in the program Tablet v. 1.11.05.03 (Milne et al., 2010). If any output, the proportion of reads at each position differing from the majority of
such problem was detected, the area was masked. Read pileups were also read calls was calculated. Only positions with >2% of reads differing from the
majority were considered polymorphic to minimize the impacts of sequencing
screened in Tablet for areas of very low sequencing depth (possible misassembly

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
352 American Journal of Botany [Vol. 99

л _ r^OOO'^0(NTl-m»r>40r-OOONO
p _ g 101Л1ПЧОЮЧО'О^ОЧО^ОЮЮЧО[^ errors, nuclear or mitochondrial rDNA pseudogenes, and potential f
î< §§§sãssãsãssss
о
phyte or fungal contaminant sequences (see Straub et al., 2011)
ČP
s
üQ >10% differing base calls were considered highly polymorphic.
о
* o ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
0 M NCO^iD^hOOaOr-iMcO^tin
Д •< 00000000000000000'0'ONO'<^ON
g Cd ^ OOOOOOOOOOOOOO
QQ Ä 1П>Л>0>0»01Г1101П>Л1П>0>01Л1П
Chloroplast and rDNA content of Apocynaceae Illumina librar
тз с í-J чоючочоюючочочочочочочочо proportions of plastid and rDNA in Illumina sequencing libraries fr
с <5 чЬ ЮЧОЧОЧОЧОЮЧОЧОЧОЧОЧОЧОЧОЮ
cd
С) g gg g g g g g g g g g g g cynaceae tribe Asclepiadeae samples (A. Liston, unpublished d
Q samples from other Apocynaceae tribes (S. Straub and T. Livshu

1
О
£
a,
a)
lished data) were calculated using BLAT as described above (Appe
Supplemental Data with the online version of this article). Librarie
«л
^ xxxxxxxxxxxxxx
< a'O<T)in00f0r-iTtiom40r^-Hin
quenced on the Illumina GAIIx or HiSeq 2000, and produced at l
«4-1
°
< ž a'O<T)in00f0r-iTtiom40r^-Hin '- - 80-bp or 100-bp sequence reads. To account for divergence from th
û
1) cpDNA and rDNA references, the estimated number of matchin
тз e
in increased by 5% for Apocynaceae, excluding Asclepias and othe
73
•в
deae, based on 0.046 pairwise divergence between the A. syriaca
Ö &
<u oleander plastomes. A ř-test assuming unequal variances implement
СЛ
D ^ xxxxxxxxxxxxxx 9.0.2 was used to test for differences between the values observed f
гн^10ЬОЮ10н'СМО'^0'^
Q
g deae and those observed
040t^rao40ooasin40inaNO^H -H -H for the
M rest
^4 of Apocynaceae.
fi
cd
Q -H -H M ^4
Vh Oh
о
О
С
о
со RESULTS
£

& xxxxxxxxxxxxxx
43 l-'OMrJOI-^O^hfOrHOOoo
S-H
<N-ir-^t(Nmm004000in40unO
< -H(N HTtH'OriH -H - I (N Genome skimming in silico - Effects of sequencing depth
£ £
<a
and read type on assembly success - Nuclear genome sequenc-
^с4

ing depths of 0.0 lx to 4x resulted in rDNA and plastome se-

-о •s
л
quencing depths also spanning two orders of magnitude (Fig. 2).
с

л
о
'S xxxxxxxxxxxxxx Sequencing depth had a significant effect on both the number of
H ONCNONONt^ONO-^tOOt^^HOOOOCN
о
о Cd ^ООО'^ОсП'-нОО-нОО-н unambiguously called bases (rDNA single-end, Fs 36 = 34.7005,
iž öödödöööödöoöö
'I £
P < 0.0001; rDNA paired-end, F8>36 = 7650.309, > < 0.0001;
тз
с
cpDNA single-end, F8>36 = 57.0139^ P < 0.0001; cpDNA paired-
cd
end, F8 36 = 205.1690, P < 0.0001) and amount of unmasked
5 40-HTJ-(N(N<N'- <1Г)О^Г-Г--Г~-СЛ-н
СЛ
о ЧООООГ-ОО^^ЧО^ОО^СЛСОСЛ sequence based on the presence of five or more reads covering
HL
о
Tt(N4000O^HLn40-^tOr-r-r-f0
"2 ONfNinmM-hO'HOoriûowo the position in the consensus sequence (rDNA single-end, F8 36
Л r4OO-HCnOi-V00N00M0'O^t
= 149.1947, P < 0.0001; rDNA paired-end, F8 36 = 15.4803, P <
V-I

^ r-Н^Ч
S
о 0.0001; cpDNA single-end, F8 36 = 1196.508, P < 0.0001; cp-
С/Г
р
"ад DNA paired-end, F8 36 = 2749.501, P < 0.0001) for both the
13 с
rDNA and plastome references (Fig. 2). For the rDNA assem-
fi
С 0 се blies, ca. 38x sequencing depth was sufficient for obtaining a
6 HioooocniooN^^fninincio
D cd »OO^I>4DOOn-4C)^ONr-OOiOTt similar number of unambiguously called bases as the higher se-
e <tN <N
<NH <N
(N 1Л (S <N r)irihrtrtH(Ot
1-H 1-H assemblies, while only ca. 25x sequencing
quencing depth
о z
с
<L>
Q depth was required to have a similar number of bases with at
00
"3
Й
Dh
fi
least 5x sequencing depth as the higher sequencing depth as-
13 semblies. For the chloroplast assemblies, ca. 30x sequencing
с
cd depth provided assemblies with a similar number of unambigu-
e?
ously called bases at 5x minimum sequencing depth as the
о

43
Ë higher depth data subsamples. Note that duplicate reads were
О
cd
<D
1 < M < < <^^< << not removed from the data set. While this would have elimi-
•s

I Ёа s аааз 3 аЁй ааЁ nated PCR duplicates, biological duplicates would also have
«H

S
TD
<L> S šs.f-gsoprsss^sšs been eliminated. Removal of duplicate reads would slightly in-
^fi о со^о^чочою^г^тспОчочосп _ crease the number of sites with at least 5x sequencing depth and
'5 о .S .S cd .s .в .в ^ ^ .S .S 2
slightly .g ethe
decrease с number
§ _ of unambiguously called bases (S.
о
£> <U <D H-Ü
3^,0^-c - - ÜÜ '(Ü e - - '(Ü 'S - Ü 'S 'S Д e
л
•fi -fi Д »fi -fi Л д д -fi -fi д -fi -fi .fi О Straub, unpublished data).
cl ÍIh Л > [IH E £ J J E '£ J ií Ё Рч g< When all sequencing depths were considered, paired-end
D
e
тз
о 2 read data sets significantly increased the percentage of the ref-
Ы) fi 'd
cnCSinM-HhONOi'rHMHcncS Cd erence sequence assembled for both the rDNA and plastome
'о S Oo|3ooMcn^inioM^^(Noo -P
с
0)
Z O^OfnrtrHrH^rHCNM^'tN ^
û -C
references as measured by the percentage of unmasked bases
s
cr e (rDNA, S = 370.50, P < 0.0001; cpDNA, S = 359.50 P <
<D 4-5 О
ся

fi fi о a ^ 0.0001). Use of paired-end information also significantly in-


<D
ад ° £ ° e -a « 5 creased the number of unambiguous base calls for the chloro-
I ä- сл § £ °^ о §1 e plast
-a reference
« ®(S -3 « F -В 5but the single-end data
cd

S g = ^409.00, < 0.0001),


(1
>-
!>>s >- MU:
Ol ® >s сл MU: й S c5 о
D

^ «á о Äro ,Р^~^ ^ ь
сл ^ р мН >< >< sets significantly increased this value for the rDNA reference
•2H
. 'o С К *С ^ S "й I ;2 ^ > S S Í ° ( S = -374.00, F < 0.0001). For the 0.5x and lx paired-end sub-
- 8, в а й 'С "С û, û û 5 ^ « <
Щ СЛ s -s 'S ^ й ^ S û, О û Ü û § J- J4 Ssamples, ^ S .y «greater
^ than 100% of the plastome reference was as-
PQ
<
•s(U
(U Ä^pSsg'llls'S't't« -s Q^300«j"~SKÜK>¡i>3i>3i>3Q 'S Оce
Q^300«j"~SKÜK>¡i>3i>3i>3Q Üû sembled because the paired-end information allowed assembly
H ^ ^ ^ ^ ^ ^ ПН ^ of small insertions of total length greater than small deletions

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 353

Fig. 2. Assembly success for Asclepias syriaca rDNA and plastome reference sequences at different sequenc
end read data subsamples. (B) rDNA reference, paired-end read data subsamples. (C) Plastome reference, sing
reference, paired-end read data subsamples. Simulated nuclear sequencing depth is given on the horizontal axis.
and standard deviation for the five data subsamples at each depth point for each high copy target is given in par
ing depth values. Dark gray bars: percentage of the assembled reference consisting of unambiguous base calls. Lig
reference with five or more reads supporting each base call (i.e., percentage of unmasked sequence). Error b
among subsets sharing a letter are not statistically significant after correction for multiple comparisons.

detected in this individual compared to the Divergence reference from the reference had a significant effect on as-
plastome
(Fig. 2D). sembly success in terms of the percentage of the plastome as-
sembled and number of contigs in the assembly, but the lengths
Assessing the utility of low coverage genomes for phylogenetic of the longest contig and N50 were unaffected by divergence
marker development - Significantly more COSH loci overall beyond the conspecific comparison (Fig. 4). Regions that were
(single-end, F8 36 = 7669.66, P < 0.0001; paired-end, F8>36 = often absent in consensus sequences derived from divergent
5737.84, P < 0.0001) and COSH loci with two or greater hits references corresponded to the most quickly evolving regions
per locus (single-end, F8 36 = 11 034.08, P < 0.0001; paired-end, of the chloroplast genome including intergenic spacers (e.g.,
Fs 36 = 15 576.86, P < 0.(3001) were hit with each increase in se- trnH(GUG)-psbA, rpl32-trnLfVAG' rpsl6-trnQ (UUG), ndhC-
quencing depth (Fig. 3), except in the case of loci with two or more trnV(lJAC' ndhF-rpl32), introns (e.g., rplló , ndhA ), and ycfl
hits per locus for the step between the 0.0 lx and 0.02x sequenc- (a putative pseudogene in Asclepias). Apart from unassembled
ing depths with single-end reads (P = 0.2471). The single-end regions, few differences were observed among the assembly
data sets produced more COSH hits ( S = 198.50, P = 0.01 12), but consensus sequences, nearly all of which were length differ-
there was no difference in the number of loci with two or more ences of 2-4 bp associated with mononucleotide repeats.
hits for single-end vs. paired-end data ( S = -7.500, P = 0.9291).
Genome skimming for phylogenetics in the Sonoran Desert
Effects of divergence from the reference on plastome assem- clade of milkweeds ( Asclepias ) - Illumina sequencing , se-
bly - Pairwise divergences between Asclepias and Nerium , Cof- quence assembly ; and alignment - Sequencing was successful
fea , and Nicotiana were 0.046, 0.098, and 0.108, respectively. for all 14 genomic libraries with input DNA amounts as low as

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
354 American Journal of Botany [Vol. 99

Fig. 4. Asclepias syriaca plastome assembly success using reference


sequences of varying divergence. Assembly qualities were compared using
the following metrics: amount of sequence assembled relative to the refer-
ence, number of contigs, longest contig, and N50. The N50 value indicates
that 50% of the total assembly is in contigs of this length or greater. Error
bars show standard deviations. Differences among subsets sharing a letter
are not statistically significant.

support
Fig. 3. Number of nuclear (Fig. 5A). The analysisorthologous
conserved of the rDNA matrix resultedsetin II (
five most parsimonious
detected among Asclepias syriaca trees (L = 175, microreads
Illumina Ci = 90, Ri = 94) and a at
quencing depths. (A) Single-end read
most likely tree (InL data
= -9260.123917), the subsamples.
topology of which (
read data subsamples. Dark did
gray bars:
not conflict with thetotal number
strict consensus of loci h
of most parsimonious
bars: number of loci with two or more microread hits. Error
trees (Fig. 5B). The analysis of the mitochondrial DNA matrix
standard deviations.
resulted in four most parsimonious trees (L = 251, Ci = 93, Ri
= 81) and a most likely tree (InL = -39012.221073) with mul-
tiple topological conflicts with the strict consensus of the most
83 ng (Table 1). The total nuclear sequencing depths rangedparsimonious trees (Fig. 5C). In the strict consensus of most
from 0.07-0.3X, rDNA sequencing depths from 53-636x, chlo-parsimonious trees, the positions of A. cutleri and A. leptopus
roplast sequencing depths from 56-300x, and mitochondrial
were reversed and had high bootstrap support, 100% and 98%
sequencing depths from 3-2 lx (Table 1). Nearly complete se- respectively. The parsimony analysis also indicated strong sup-
quence assemblies were obtained for the plastome and rDNA port (bootstrap 99%) for the sister group relationship between
cistron for all individuals. The percentage of masked bases in A. albicans 422 and A. subulata 411.
the plastome assemblies ranged from 0.01 to 1.63% (median
0.21%). The percentage of masked bases in the rDNA assem- Intragenomic rDNA polymorphism analysis - Every individ-
blies ranged from 0 to 0.99% (median 0.15%). The mitochon-
ual analyzed was at least somewhat polymorphic at many posi-
drial sequence assemblies were much less complete due to the tions and highly polymorphic in at least a few positions. Low
low sequencing depth obtained for most individuals. Additional
proportion polymorphisms (proportion of reads differing from
sites not masked for that reason had to be masked due to se-
the consensus 0.02 < TV < 0. 1 0) ranged from 44-36 1 polymorphic
quence rearrangements vs. the A. syriaca reference. The per- sites per individual, with a mean of 156 polymorphic sites (95%
centage of masked bases in the mitochondrial genomeCI 104-208). Highly polymorphic sites (proportion of differing
assemblies ranged from 1.38 to 83.71% (median 14.41%). Four reads >0.10) ranged from 4-84 per individual, with a mean of
sequences were removed from the matrix because they con-
tained >30% missing data, considering both gaps and masked
bases. With these sequences removed, the median percentage
of masked bases was 6.25%. In total, the three matrices con- Table 2. Sequence obtained from and variability observed in the three
genomes of the Sonoran Desert clade of Asclepias.
tained 159981 characters, 0.5% of which were parsimony in-
formative in the ingroup (Table 2; Appendices S2-S4, see
0f All samples Ingroup only
Supplemental Data with the online version of this article).
Genome sequences Characters Variable PICs Variable PICs

Phylogenetic inference - Parsimony and maximum likeli- Nuclear (rDNA) 15 5878 150 106 72 50
hood analyses of the plastome matrix resulted in one most par- Chloroplast 15 130440 2069 1157 1045 730
Mitochondrial 11 23663 232 50 71 15
simonious (length [L] = 2207, consistency index [Ci] = 94,
retention index [Ri] = 94) and most likely tree (InL = Totals 159981 2541 1313 1188 795

-194692.00399) of nearly identical topology and high bootstrap Note: PICs = parsimony informative cha

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 355

A ioo/ioofAa/b/cans422 15.8 highly polymorphic sites (95% CI 4.9-26.7).


1 00/1 00 A subulata 411 rDNA cistron polymorphic sites were more abunda
regions (%2 = 6.28, df = 1, P = 0.012), with 32.8% of
ÍA. A. M albicansalbicans...albicans
A.A.MMalbicans subulata. ,. .,.... 282x 003
...xsubulata
the spacer regions and 26.6% of coding positions pol
1Ó0/100 at least one individual. Highly polymorphic position
more skewed toward abundance in the spacer regions (
A. masonii 154 df = 1, P < 0.0001), with 9.0% of positions in the sp
05/96

51/38 i - A. subulata 423


and 2.9% of coding positions highly polymorphic in
individual (Fig. 6). Twenty-six positions were highl
100/1001
100/100
phic in more than one individual. At 18 of these po
individuals that were highly polymorphic were either
species or are sister species in the rDNA phylogeny
I

Chloroplast and rDNA content of Apocynacea


libraries - In 80 Apocynaceae libraries,
1ПП/1ПП the
i averag
100/100 I A. macrotis 1 50 of cpDNA sequence was 10.4% (SD 6.0) and the a
centage of rDNA was 1.0% (SD 0.5); however, a wid
individual variation was observed (Fig. 7). The mea
7.0E-4
age cpDNA (1 1.4% SD 5.7) and rDNA (1.1% SD 0
in the Asclepiadeae (online Appendix SI) are both s
g |- A albicans 003 greater (cpDNA, t = 7.80 df = 30.53 P < 0.0001 ; rD
57/ 'a df
subulata = 49.72 P < 0.0001) than the values observed
423
i A subulata 41 1
from other Apocynaceae tribes (cpDNA = 3.66% SD
180/-
= 0.43% SD 0.2).
'A albicans x subulata 282

79/8Z. -A. albicans 422


DISCUSSION
Ia subaphylla 272
i - 100/100 M Ia 99/99 Genome skimming in silico - Effects of sequencing dep
lA subaphylla 271
69/52 and read type on assembly success - This simulation s
- A. masonii 154
helped address the relationship between sequencing depth
100/100
assembly success for high copy portions of the genome, r
sented by the rDNA cistron and plastome, and whether h
positional information from paired-end reads increases th
100/100 Ia macrotis
fectiveness of these assemblies. The key value for each of
149
99/1 00 'a macrotis 1 50
combinations is the sequencing depth of the high copy tar
rather than the overall sequencing depth, as the latter w
variable for every sequencing library. While a near-plate
assembly success was reached in all cases between ca. 25^-
0.0020 sequencing depth (Fig. 2), there were important difference
tween rDNA and chloroplast data sets and between single
Ç i-A albicans 422 and paired-end treatments in each data set. The rDNA as
A. albicans 003 blies, for instance, required substantially higher sequenc
67/- - 77/- depth in single-end compared to paired-end treatments to
A. albicans x subulata 282
maximal assembly success, although single-end treatments
significantly more unambiguous base calls. In contrast, in
99/98 *-a. subulata 411
single-end and paired-end cpDNA assemblies the minimu
-A subulata 423
quencing depth for a high quality assembly was around 30
paired-end assemblies produced both more positions with
L A subaphylla 271 coverage and more unambiguous base calls. Our results h

60/51

г A macrotis 149
100/100

*• A. macrotis 1 50
based on whole plastome sequences. The asterisk indicates the loca
the single topological difference between the maximum likelihoo
tree presented and the single maximum parsimony (MP) tree, whe
A. syriaca 4885
branching order of A. cutleri and A. leptopus is reversed. (B) ML
depicting phylogenetic relationships based on rDNA cistron sequ
5.0E-4
(C) ML tree depicting phylogenetic relationships based on partial
chondrial genome sequences. For (A-C), numbers above branches r
sent
Fig. 5. Phylogenetic relationships among species of theML/MP
Sonoranbootstrap
Desert support >50% for the ML topology. Sca
clade (SDC) of Asclepias (Apocynaceae). (A) Phylogenetic relationships
represent the number of substitutions per site.

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
356 American Journal of Botany [Vol. 99

Fig. 6. Number of individuals (N = 15) that are polym


shaded. Numbers directly below the region labels indica
individuals with a proportion of reads differing from t
the consensus >0.10.

cause their mates are mapped with high confidence, whereas


are generally reflective of those of Li and Homer (2010), whose
single-end reads differing too much from the reference re-
simulations supported the use of paired-end reads to reduce the
number of incorrectly mapped reads and incorrectly calledmained unmapped (Li and Homer, 2010).
SNPs. Our results further indicate, however, that not only se-
quencing depth and read type, but the identity of the genomicEffects of divergence from the reference on plastome assem-
bly - Plastome sequence assembly from Illumina short read
target, will also influence the success of single-end or paired-
end read assemblies. data are especially amenable to reference-guided assembly
Contrary to expectation, the largest data subsets for both (e.g., Givnish et al., 2010; Straub et al., 2011). Issues that will
rDNA and cpDNA suffered a decline in quality measured by a
affect the applicability of this method for any particular study
decrease in the amount of unambiguously called bases, which include availability of reference plastome sequences and diver-
was more pronounced in the paired-end sets. For the rDNA gence sub- from the nearest available reference. The divergent refer-
sets, this is likely due to rDNA polymorphism within or amongence test conducted for Asclepias indicates that even with a
rDNA arrays being sequenced at high enough depth that it
distantly related reference, Nicotiana (0.108 pairwise sequence
reaches the threshold for ambiguous calls in BWA. Although divergence), more than 90% of the plastome can be assembled
heterozygosity due to heteroplasmy is not generally expected with minimal errors. Use of an Apocynaceae reference ( Ne -
among plastome copies, the presence of mitochondrial and rium nu- ) or Gentianales reference ( Coffea ) did not make a signifi-
clear pseudogenes of chloroplast origin can cause apparent cant
het- difference in how much plastome sequence was assembled,
eroplasmy. Incorporation of these reads only becomes a problem
even though both performed significantly better than the most
at higher sequencing depths because it is only then that these
divergent reference. As expected, the conspecific reference per-
pseudogenes have enough sequencing depth themselves to formed be- the best with 99.5% of the plastome assembled. Some
come readily misincorporated into assemblies. This is likely oftothe unassembled portion corresponds to insertion/deletion
be a problem in most plant groups as transfer of chloroplast polymorphisms between the two A. syriaca individuals, while
DNA into the mitochondrial and nuclear genomes is a well- the rest corresponds to sequence that cannot be correctly as-
documented phenomenon (Arthofer et al., 2010; Knoop etsembled al., from short read data due to chloroplast-mitochondrial
2011). That the effect is larger in the paired-end sets is likely
pseudogene interaction or repeats (Straub et al., 2011). These
due to reads containing the polymorphisms being mappedresults be- are similar to those obtained in other studies. Nock et al.

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 357

Assessing the utility of low coverage genomes f


netic marker development - Even at the lowest si
quencing depths, reads originating from single-co
loci were detected. Increasing sequencing depth in
number of loci hit and numbers of loci with more than two hits
to facilitate primer or probe design for phylogenetic marker de-
velopment. Having single-end reads resulted in more genes hit
overall, which would be expected due to the close physical as-
sociation of paired-end reads decreasing the likelihood of mates
hitting different genes. This same close physical relationship of
the paired-end reads leads to the prediction that paired-end read
pools would have more loci with two or more hits per locus due
to the short physical distance separating the reads. However,
this relationship was not observed. If this effect is present, it
may only be important at lower sequencing depths because the
pool of targets is finite. Additionally, the differences observed
for this metric for the single-end and paired-end read pools
were small and with only five data subsamples per sequencing
depth, the power for this analysis was too low to detect if these
were significant.
Primer or probe design for COSH candidate loci could be
considered an added benefit gained from genome skimming for
plastome sequences. In the example presented here, targeting of
Fig. 7. Scatterplot of plastid and rDNA content of Apocynaceae Illumina
the
libraries with a sampling focus on Asclepias and other plastome at ca. 30x depth would also yield the hits required
Asclepiadeae.
for primer design for ca. 50 COSH candidate loci. If intron se-
quence is desired, the locations of the read hits could be com-
pared to the intron locations in each candidate gene in
(201 1) determined that the rice plastome was Arabidopsis
a sufficient(DC.) Heynh.
refer- to design primers to specifically tar-
ence for reference-guided assembly of other grass
get them plastomes.
(Straub et al., 201 1).
Use of the conspecific and confamilial references did reduce the
Utility of for
total number of contigs obtained, which is desirable low coverage
ease of genome sequencing for phylogenet-
downstream analyses of the assemblies. The ic other
s : An example
measuresfrom the ofSonoran Desert clade of milkweeds
assembly quality and success, longest contig ( Asclepias
and N50, ) - The
didSDCnot consists of four core species of shrubs
( Asclepias albicans
show any difference between the divergent references , A. masonii
tested. It , A. subaphylla , and A. subu-
should also be noted that use of 80-bp reads lata', Fishbein improved
greatly et al., 2011) endemic to the Sonoran Desert and
two species
assembly success over that achieved with shorter readsfound (40in peripheral
bp) areas on the western slope of
in a previous study (Straub et al., 201 1). the Sierra Madre Occidental and the Colorado Plateau (A. cut-
Reference availability is not likely to be a leri and A. leptopus
problem for most ; Fig. 8). The clade is of interest because of
studies. Given the plastome sequences currentlythe striking discordance
available in between molecular phylogenetic re-
GenBank, the majority of angiosperm (89%) sults
and andgymnosperm
a classification based primarily on floral morphology
(99.9%) species belong to orders with one or (Woodson, 1954), which proposed distant relationships among
more references.
Choosing the least divergent reference from A. cutleri,
the studyA. leptopus
group , A. subulata , and a taxon including A.
will produce the best plastome assemblies (see
albicans , A.also
masoniiParks
, and A. subaphylla.
et al., 2010). It is noteworthy that the regions most likely to be
absent in NGS assemblies from divergent referencesMolecular phylogenetics
(e.g., in- of the SDC - Phylogenetic resolu-
tergenic spacers) are the same highly variable tion, support,
regions and topology
that differed across whole plastome,
whole rDNA cistron,
have become very popular in molecular systematics and concatenated contigs from the mito-
studies
chondrial
(Shaw et al., 2007). The loss of these regions genome
in NGS (Fig. 5). In the plastome phylogeny (Fig.
assem-
5A), the strong
blies is more than compensated by a large increase support for the monophyly of the core SDC is
in sequence
congruent
data and the avoidance of alignment difficulties with the
often result of Fishbein et al. (201 1). Within the
associ-
ated with these hypervariable loci. As NGS corereadSDC, none of the
lengths plastome sequences of the three species
con-
for regions
tinue to increase, the ability to assemble such which twowill accessions
also were sequenced were recovered as
likely improve. sister to each other, making direct comparison to the results of
The results of this study highlight that theFishbein
use of et divergent
al. (2011), in which only a single accession was
references, such as in the Asclepias assembly sequenced
withper thespecies, difficult. Nonetheless, incongruence be-
Nicoti-
tween
ana reference, can provide assembly of greater the studies
than 90% of is apparent.
the Fishbein et al. (2011) found A.
plastome, which should be sufficient for mostsubulata! A. masonii and
phylogenetic A. albicans/ A. subaphylla to be sis-
ap-
plications. Regardless of the divergence fromter-species pairs, but neither of these pairs was recovered here.
the reference,
Notably,
some changes, such as shifting of the inverted repeattwo of four core SDC accessions used by Fishbein et al.
boundary,
(201 1) were
will typically require hand curation. Additionally, it also sampled
should bein the present study. Thus, conflict-
noted that reference-guided assembly is noting phylogenies
tenable forofhighly
the plastid genome in the core SDC likely
rearranged plastomes (e.g., Guisinger et al.,derive
2011),from (1) nonidentical
which re- sampling, (2) nonrepresentative
quire de novo assembly. signal in the regions used by Fishbein et al. (201 1), and/or (3)

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
358 American Journal of Botany [Vol. 99

population size of the plastome relative to nuclear loci. There is


also moderate support for the monophyly of all core species to
the exclusion of A. masonii , again strongly conflicting with the
plastome phylogeny. Placement of A. masonii as the earliest-
diverging lineage in the core SDC could have important implica-
tions for the biogeography of the Baja California peninsula, because
it would be consistent with the persistence of an early-evolving
desertic lineage on barrier islands of the Pacific coast. Parsi-
mony and likelihood analyses both weakly support the sister
relationship of A. cutleri to the core SDC, as favored by the MP
plastome phylogeny, but in conflict with the ML plastome phy-
logeny and Fishbein et al. (201 1).
Phylogenetic analysis of concatenated mtDNA contigs pro-
vides novel results (Fig. 5C); however, these are mostly very
weakly supported. Monophyly of the core SDC is again strongly
supported, but strong support within the core clade is found
only at conflicting nodes between the MP and ML analyses. As
with plastome and rDNA data, basal relationships of the SDC
are problematic. Parsimony places A. leptopus as sister to the
core clade, but A. cutleri is found (bootstrap <50%) in this posi-
tion in the ML tree. Notably, the monophyly of the SDC is
questioned by the mtDNA data (Fig. 5C). This mitochondrial
evidence for nonmonophyly of the SDC is intriguing because
precise phylogenetic relationships depend on analytic method
and there are no morphological data supporting a close relation-
ship of A. macrotis to core SDC species. Additional scrutiny of
this sample of the mitochondrial genome for artifactual phylo-
genetic signal is warranted.
These results demonstrate how NGS can provide large
amounts of data from all three plant genomes for phylogenetic
studies (Tables 1,2). However, the amount of incongruence ob-
served in this study and in comparison to previous work (Fishbein
et al., 201 1) for a small clade (six species), with only limited
Fig. 8. Geographic distribution of the Sonoran Desert cla
within-species sampling, focuses attention on the prediction
pias. Labeled dots indicate the collection localities for th
that large amounts of data will not necessarily provide a deci-
cluded in this study. Note that the range of A. albican
sive picture
completely within the broader of evolutionary
range of history
A.due to gene tree/species The
subulata. tree un
graphic was produced at the conflicts (Degnan and Rosenberg,
planiglobe map2009; Blair and Murphy,
server website
2011). Interlocus
planiglobe.com and is licensed under incongruence will likely become
a Creative only more
Commons A
License. apparent as we further explore the phylogeny of this group
through increased sampling of individuals and single copy nu-
clear genes working toward a species tree for the SDC and the
other 115+ species of Asclepias. However, we have not yet
the effect of additional within-species sampling, perhapstaken advantage of approaches to coalescent modeling of phy-
involving incomplete lineage sorting. The remaining members of logenies that may strengthen the inference of species trees by
the SDC, A. cutleri and A. leptopus , were resolved by Fishbein assessing the probabilities of incongruent gene trees (Edwards,
et al. (201 1) with A. leptopus well supported as sister to the core2009).
SDC. This relationship was recovered in the ML phylogeny of
whole plastomes with very weak support. The internal branch Intragenomic rDNA polymorphism in the SDC - With the se-
uniting the core SDC with either A. cutleri or A. leptopus isquencing depth afforded by our approach to genome skimming,
considerably shorter than any other branch, including withinwe detected considerable polymorphism among rDNA cistron
species. This result suggests a rapid diversification at the origin copies within individuals. Because fragments are read indepen-
of the SDC and demands an intensive multilocus approach to dently, rather than collectively as in Sanger sequencing, differ-
resolving this key node. The additional nuclear loci that we ob- ences among reads at homologous positions can be easily
tained through our genome skimming approach should be use-quantified. These data provide information on mutational and
ful to this pursuit in the future. evolutionary processes in high-copy regions and can reveal
The rDNA phylogeny (Fig. 5B) is congruent in broad strokescryptic, possibly phylogenetically relevant, diversity within and
to the plastome phylogeny, with important conflicts within the among target taxa. Here, we have quantified polymorphism
monophyletic core SDC. The two accessions of A. subaphylla among the -1800 copies of rDNA present in Asclepias (Straub
are strongly supported as sisters, strongly conflicting with the
et al., 201 1). Not only does genome skimming reveal levels and
cpDNA data and suggesting the existence of either incompletepatterns of rDNA polymorphism, it also reinforces the caveats
lineage sorting of plastome variants in the species or plastomeassociated with using these sequences for barcoding efforts and
introgression. The latter seems more likely because of the small
phylogenetic analyses (Álvarez and Wendel, 2003; Feliner and
extant populations of A. subaphylla , and the smaller effectiveRossello, 2007).

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 359

Levels of rDNA polymorphism within lar the to SDC


the rangetaxaof variation
are reported here for Apocynaceae
(0.7-29.5%;
comparable to previous studies in other organisms Fig. 7). It
(Ganley is noteworthy that values below 3.9%
and
werestrongly
Kobayashi, 2007; Stage and Eickbush, 2007). The found onlyuneven in very young leaves (<10% mature size),
distribution of highly polymorphic positions whereas
across leavesthe>50%rDNA of the mature size had >15.4% plastid
DNA (Rauwolf
cistron relative to the much more uniform (though stillet al., 2010). On the other hand, as leaves se-
signifi-
nesce, cpDNA content
cantly skewed) distribution of low-level polymorphisms sug-typically declines (Rowan et al., 2009).
gests that polymorphisms can be tolerated at Thus,
mostto maximize
positions the yield
at of plastid DNA in a sample, healthy
low levels, but highly polymorphic positions recently
within matured
coding leaves are recommended for DNA extraction.
re-
gions may be selectively disadvantageous. Optimizing sequencing protocols for the expected amount of
We also discovered intriguing patterns oftargetrDNA DNA in an Illumina library is complicated because ge-
polymor-
phism among individuals. Most instances of nomic
positionslibraries will were
that include variable amounts of both target
highly polymorphic in multiple individuals DNAwere andfound
adapter/primer
withincontamination dependent on the study
species or in sister taxa. The presence of organism,
homologous the sourcerDNA tissue, and the method and efficiency of
positions that tend to vary within individualslibraryof preparation.
the same Theorproportion of adapter/primer DNA pres-
ent in a library can
closely related species has several possible explanations: they be estimated prior to sequencing through
quantitative electrophoresis
may be the result of retained ancestral polymorphisms; they (Quail et al., 2008). The copy-num-
may indicate positions with inherently higher ber mutation
ratio of target to nontarget
rates in genomic DNA may be quantified
some species relative to others; or they may through
indicate qPCR (Castle et al., 2010; Givnish et al., 2010; Lutz
selectively
neutral mutations that are more likely in someet al., 201 1) and, when
species than combined with estimates of total genome
others, especially at low frequencies within size,
thecould be used to estimate the proportions of target and non-
genome.
target DNA in genomic pools. In cases where qPCR technology
Chloroplast and rDNA content of Apocynaceae is not available Illumina
or is cost-prohibitive, it may be best to conserva-
libraries - Plastid DNA content variation intively estimate the proportion
Apocynaceae en- of target DNA as 5% or less of the
genomic DNA in
compasses the range observed in species representing a library. For example, in this study, 80% of
diverse
taxonomie groups, genome sizes, and growth sampled
forms Apocynaceae
(Table had 3). >5% cpDNA, and 60% of the other
Although no trends are apparent in this small seed plants
sample assessed
of seed had >4% cpDNA. Therefore, when the
plants, our focused sampling of Apocynaceae amount
indicatesof cpDNA is unknown, using 4-5% as a conservative
signifi-
cant differences in plastid (and rDNA) contentestimate in multiplex calculations
between the (discussed later) may be rea-
sonable. A R.Br.,
Asclepiadeae (represented by Asclepias , Calotropis second factorand to consider is the sequencing depth
desired,
Pergularia L.) and 1 1 genera representing seven which tribes
other ultimately of depends on the aims of the project.
Apocynaceae (Fig. 7, online Appendix SI). The Although
reason base-call
for the accuracy for next-generation platforms is
improving,in
relatively low plastid and rDNA content observed error rates are still higher than for traditional Sanger
Apocyn-
aceae outside of Asclepiadeae remains unknown.sequencing (Shendure and Ji, 2008; Harismendy et al., 2009).
Determining
whether this reflects a phylogenetic pattern, Thus, coverage depths
a correlation withof at least 20x are commonly recom-
growth form (primarily herbaceous vs. woody), mended for confidence
ecology (warm in SNP calls (Bentley et al., 2008; Dohm
temperate-subtropical vs. subtropical-tropical), et al., or
2008; Harismendy
some other et al., 2009; Whittall et al., 2010),
factor, such as the developmental stage of the although
tissue a number
at theof probabilistic base-call algorithms are now
time
of collection, will require much greater sampling. available that can more efficiently measure confidence in lower
coverage variants (R. Li et al., 2009; Malhis and Jones, 2010;
Next-generation sequencing for genome-enabled Ratan et al., 2010). Regardless, it is advisable to aim for overall
systemat-
ics studies: Considerations and recommendations - Consid- average target coverage depths that are higher than the desired
erations for genome skimming in plants - As demonstrated here sequencing depth to account for the difficulty of precise genomic
and in other studies (Givnish et al., 2010; Meyers and Liston,library quantification and the unequal sequencing efficiency of
2010; Nock et al., 2011; Steele and Pires, 2011; Straub et different
al., multiplex tags (Craig et al., 2008; Cronn et al., 2008),
2011), information about the high-copy fraction of plant ge- as well as to minimize missing SNPs due to dips in coverage.
nomes (rDNA, chloroplast and mitochondrial genomes) is ob- Finally, methodologically based loss of sequencing capac-
ity must also be taken into account. For example, under- or
tained relatively easily through Illumina sequencing of genomic
DNA library preparations. These regions can be sampled from overloaded sequencing surfaces can lead to losses in output
a large number of species at a low cost using genome skimming through underutilization of sequencing space or decreased
in combination with multiplexing strategies for massively par-read quality due to high cluster density, respectively. Barcode
allel sequencing (e.g., Cronn et al., 2008; Meyer and Kircher,sequences used to distinguish multiplexed samples also con-
2010; this volume: Cronn et al., 2012). Targeting of the plas-sume sequencing capacity (a 6-bp barcode would account for
tome and rDNA, standards of molecular systematics studies 6% for of a 100-bp read), and a certain proportion of reads will
the last 20 years, requires consideration of their proportionsnot
in be assignable due to sequencing error in the barcode. It is
possible to minimize these losses to a degree, for instance,
a genomic DNA extraction and subsequent Illumina library.
The amount of rDNA present in a sample will be based on the through quality control of cluster generation (Quail et al.,
number of rDNA repeats in the sampled genome, a quantity 2008) or the use of barcode indexes placed within one of the
known to vary even at the population level and ranging from adapters, and identified in a separate sequencing reaction
hundreds to tens of thousands of copies (Rogers and Bendich, (Meyer and Kircher, 2010). Still, it is likely that these types of
1987). The amount of plastid DNA per cell is even more highlydifficulties will continue to claim significant portions of se-
labile, varying among organs and developmental stages. A re- quencing capacity. As a result, a conservative estimate is
again recommended, such as a total loss of 10-30% of opti-
cent study of sugar beet (Rauwolf et al., 2010) observed a range
mum sequencing output.
of variation (0.4-22.4%) in developing leaves that is very simi-

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
360 American Journal of Botany [Vol. 99

S 22 "č3 S2 ^ 22 'ёЗ £ Id 23 22 ^
<о ^ о 'S о й о С О й о б о С о S о S о g fi
^ 0)0 0) e 0) e 0) G 0) в 0) CS о) fi 0) С О) fi
-fi о <U -г* гй U Ü О Р O<UO<UCJ<D 'ri -fi о <u о <u о <u
о сз *r¡ S гй 5 D л U ir! 03 ÎT> й Ь й Ь й h S fi -fi S^Ín
>.t;оо *r¡
g i» о,i»
о чч es
fiсзD'иnn U л Оч
о<т; g Л й Ont;
Л tí Л ïЛII OhT¿
й fi II3Он
,fiс сйт;Х>
,fi йat:,0Х>йOnt;
лч ,0
и Он лч и
2 0)^fiC(U1)
¿3 »fi Ъ cd cdО)
хз <U
.fi <D fixi
si xi fiesОesОxi
OJsz b ^
л b

£
Q O^OO^HfNON 00 OO О ON (N CH <N Г- ЧО
£j ГП ON <N Г-; <N <N iO СО О ON OO ЧО Г~- Г-
^ CO IO ^ Tf ON CH Tt'cn^-Hr^fN r-^
_Ç£ CN О 00 ^ IO 00 <N ON OO^O^fOO ON
'a oo on io m r-- oo г- чо чо т-н сн ^
$ ЧО <N ЧО Ю Г- О ON »O O O ^ (N
С »Domoo'-H^H ю - < ^ on TJ- o <n - < oo
гн о о »о сн чо rt tj- r» S ю 10 on »о »о о
гн § О О ЧО ЧО <N ЧО Z io ЧО 00 ЧО ^-н r-
0) io On on en m on io со со
tJ- ^ -н <N <N

<u
о
a в в
e es cd
^ 43 & Л cd ed ^
о
о
Он
^oj^DOOTíTíTSTa чз
< S'SI'SS о 'S 'S 'S 'S "Я
с 8 й = "з" « S I Sž -Č
а
S e з § 's e e ¡53 ¡S .a ¡S
£ ооt/3
?г.&|ьо&|2 2-о-о 3 .o _ x>
"I 3 3 ^ .o Д _ 1- I r-H *- li- I T-H r-j
5-1
g S ^ 3 .H .H Он Он CL Он Д OOOOO ft
<D
S O >;° ^ >; J Jfifi Он Он & CL С Он Д Й
о Q O . ."fi «-Г .fi >; ^ "T3 fi fi 5 fi ^ г г г г г fi
S2 S2 ^ 2 rt § С С n1 с ^ 'rt 'rt Id Id сз
2 С ! С I n1 i с ^ : 'rt ssssiu 'rt Id Id s с
сл

В
Ä 'S .2 'S £ £ -4 b S m м м м M -S
л
Он со H I 8 8 8 8 8 J
о c¿ E-; S S < < z zz z z <
св

cd
D
О,
ел
« â
cd
M тЗ ^3
а сЗ °о Tj- 1 ^ УО 'о ^ ü: ^ ^ л
03 Ttl^OOOOOO (N ON О OOOOO „22
è e Tt чо чо чо (N <N со On »fi cncncncncn fifi3
5 io t ^ î: >o ю oo oo .22 сосососпс*-) о -fi
r»n
^ ON^ ^^hOnON^H^H
O t O O 0 <N QOO^H
-fi <N -fi
(N (N
(N (N
<N(N
<N<N <N'S'SОн
.22 G
о
о 0 I I I ÍN fi io IO io «О Ю »_) fi
uuaauu UP& >,>ц>Н>Н>Н Г=>
s züfi < << < < с
о
с
D
00 CS

5 Й
cdcd S
--(
0
^ r*5 -2 --(
2á OQ a a a a a "s
В о ^ w a с
еккке'к)
ÊЧ-ч
Ч-ч 6 « -S 8 R J i R
3 « s J J s s « « s s
тЗ

1 fi ^2 H Ъ 4 % ? ?? ? ? e
Он
« S s § §■ « -è -è-ê •« -ê ^
S^^ .§§^§S§aQ
« .§§^§S§aQ S S1 а
I Iflf-c -g а § = §■ §
OlI Ol
сл
D
•§
J § oo.S 00 g S a ^ S bbbbb
1 qoo a o oo o o í
•'S
cd
С <D Он
g X) « я ^
e О S я °°<ťcn ON CO io <t- ON СП
J3 tí ^ ZZ^i-
o>o) ^ ^л ^^ лл лr°°<ťcn
л y onZ Z
4040
000'40¡9!40
m Tř onZZ<t-
oo ЧО
oo
O N ^ -H r-
TD сл
<U
О
с
<D
D
cr
<U
сл
-S д. 5 О ^ с ® ^ ^
-SД 3й J 1 д. и d О < a® э 5i
С

i й s a®
^ г- ~ о w -t-j OhZDO1^ ir5
Q .§ ^ -J 1 г- 1 ~ О -a § ^ и ! äi S. ^ -S

g, I § ö § ö 5 § I J g1 ^ "2§.§1д I
o,
o
сл 2? -2-3 g) 2? b о^^j.S"S
^ , g) ^'"S
b сз^ ^ a
Wh

S2 èS2
-S Q ^QS 'S
S <3
.'S^0)<3
^ „ ,«"è
<3Q S
^ "Ьа
"Ьа2 s
s af
^ ^^
S
! J §£§!'§ '§ g i á .s s
I ! Г QUO
l"i|l| Г I g 1<i
O
Ч-ч
O
с
o OOOÍ5 Í g-
'fi 1 §5
0
Оч S
cdcd S
cd cdcd
Й cd
ün< SQQ -d
щ
О) 0) о) ün ^ ск
-у щ
1 «uo-fi-fioío
о
cd
о
cd
о
о) cd
(U^'C
г
-S
OJ
v
d
^
O
cdcdccÇdjs cdí4ídrr.o>o)o>o)o> OJ ¿Й ^r>*rb5 O
co
^ ..cd
0)0>55jl)cu
О-cdОВS fi
2 oed ed
ed
S"řj
2 cd
oiîà'fi+Ûcdcdcdcdcd
SáifiO.S ed ^^ nZ oO«
fi ooDоО)
S ^J^cd
О) <ùu-UhO
о инУ сл?3 u-UhO
НЙя?
Ш
J cd ад
•filScd cd-ЬЗ ВЙВОЁЙfi
.fi О Й
-d ed
К ft Йcd ed
fi -d
go "řj
OO O•§Ocd
t2 В
.5 fi -С fi n ^ o еЗ cd oo cd
ЬЦнК
3 В «вь PU j i- I o¿! o¿! _) ^ o^ cuf^ &H с eu
H (2 < S S

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 361

Taking the discussed factors into consideration, one


Targeting could
nuclear rDNA - Although we found that 40x se-
then estimate an appropriate multiplex level to maximize
quencing the
depth was sufficient for high quality rDNA cistron
assembly, this platform
return on sequencing effort on any desired sequencing was with a full reference sequence. (See Straub
as follows: et al. [2011] for details on using ITS sequences [hundreds of
bp] as the seeds to build cistron-size [thousands of bp] assem-
blies.) We suggest sequencing rDNA to a minimum lOOx final
л/гт LCCFPTG
ML л/гт = sequencing depth to facilitate assembly and to survey and ac-
CDTaG
count for the polymorphism among repeat arrays. We recom-
mend masking sequences at bases with less than 5x depth to
where ML = multiplex level possible, LC = lane prevent capacity ofincorporation of sequencing errors into
unintentional
sequencing instrument in base pairs (factoring inconsensus
both thesequences.
num- This last recommendation applies not
ber of sequences generated and read length); CF only=tocorrection
rDNA, but to all assembled targets.
for tag, adapters, reads not passing quality filtering; PTG = pro-
portion of reads in read pool mapping to genomic target,
Targeting CD = - Our results indicate that sequencing
plastomes
coverage depth desired (e.g., 20x, 40x, etc.); and
the TaG = target
plastome to a minimum 30x final sequencing depth for sin-
genome size (bp). gle-copy regions will yield high quality assemblies. We recom-
Paired-end sequencing, of course, would allow twice SNPs
mend masking the vs. the reference in reference-guided
number of samples to be sequenced in a singleassemblies
lane. For withexam-
less than 20x depth for the substitution base
ple, paired-end sequencing in a single lane of thecall,Illumina HiSeq
especially when the sequences are meant for downstream
2000 with 100-bp read lengths, assuming an optimalphylogeneticcluster
analyses. Although quantification of the propor-
density of 135 million clusters per lane and 20% lost
tion sequencing
of plastome target DNA could be useful due to the vari-
capacity due to quality filtering and barcode sequence, would
ability observed in plants for this quantity, we do not generally
recommend
theoretically allow at least 135 samples for plastome qPCR. The costs associated with universally ap-
sequencing
(5% cpDNA, 50x depth) to be multiplexed in aplying single thislane. Al-are likely to be greater than the resequenc-
technique
though high levels of multiplexing would require a significant
ing costs for libraries that need additional sequencing depth for
investment in oligonucleotides, the sequences successful
of over plastome
200 bar- assembly.
code index adapters are available (Meyer and Kircher, 2010).
Targeting the mitochondrial genome - General recommen-
General recommendations - The following are general
dations rec- mitochondrial genomes are more difficult
for obtaining
to make due
ommendations for using the genome skimming approach to the wide variation in their size and complexity
presented
here for plant molecular systematics studies (see observed
also Fig. among
1). plants (Knoop et al., 2011) and the analytical
challenges associated with assembling repeated sequences
Library preparation - Although the standard Illumina
(Schatz li-
et al., 2010). At the least, the highly conserved coding
brary preparation protocol calls for 1-5 jug of input and
sequences DNA, we
nonrepetitive flanking regions can be obtained
have demonstrated that Illumina libraries can be success- through reference-guided assembly (Straub et al., 201 1). Sequence-
fully prepared and sequenced using much less DNA masking(as low criteria should follow those suggested for rDNA and
as 37 ng, with 24 libraries with less than 200 ng inputcpDNA
DNA; sequences.
online Appendix SI). Note that due to the high-copy nature
of our target DNA, we performed half-size reactions, which Targeting the nuclear genome - As evidenced by the COSH
would not be appropriate for studies with low-copy nuclear analysis, even very low coverage genomes can be mined for
regions as the primary targets. Due to our eliminationinformation,
of the although for nuclear genomes the amount of infor-
size selection step of the Illumina protocol to shorten prepa-
mation obtained will be dependent upon genome size, which
ration times and minimize DNA loss, use of the smallest varies by orders of magnitude in plants (Bennett and Leitch,
amounts of DNA may be most appropriate for single-end se- 2011). Beyond the COSH resource available for asterids used
quencing because downstream applications for analysis ofhere, studies in other groups of interest could take advantage of
paired-end data may require a narrow range of insert sizes toputatively single copy nuclear loci identified across a wide
be effectively used. The success of genomic library prepara- range of angiosperms (154 genes, Shulaev et al., 2011; 959
tion with very low input DNA also means that DNA extrac- genes, Duarte et al., 2010). In addition to using traditional PCR-
tions from herbarium specimens are likely of high enough based methods to develop these loci for phylogenetic studies in
quality for Illumina sequencing (e.g., A. macrospermaany particular group followed by either traditional Sanger se-
Eastw. DNA extracted 21 yr after specimen was collected;quencing or amplicon sequencing on NGS platforms (see Cronn
online Appendix SI). These results indicate that plant sys-et al. [2012] in this volume), the information could instead be
tematists could begin genome skimming in their groups of used to design probes for solution hybridization or other ge-
interest with stocks of DNA they already have on hand. Al- nome reduction or targeting methods (see Cronn et al., 2012).
though we have successfully sequenced hundreds of genomic Microsatellites for population studies are also readily identified
libraries with barcodes placed at the 3' end of the ligatedfrom Illumina data (e.g., Jennings et al., 2011).
adapters (Cronn et al., 2008), we have occasionally had lanes
that were incorrectly processed by the Illumina base-calling General strategies - One strategy that could be useful for the
algorithm, apparently due to unbalanced base compositiontypes of studies described here is to choose a species from the
resulting from the multiplex indexes (Kircher et al., 2011).
group of interest and sequence an individual more deeply than
For this reason, we recommend the use of barcode indexeswhat is planned for the study as a whole. Combinations of ref-
that are internal to the sequencing adapter (Meyer and erence-guided and de novo assembly (e.g., Whittall et al., 2010;
Kircher, 2010). Straub et al., 201 1) or iterative reference-guided assembly with

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
362 American Journal of Botany [Vol. 99

successively sequencing depth for


less-divergent reference-based approaches to(Givnish
references assemble
Straub et al., 201 1, thisnearly complete plastome
study) can and ribosomal
then DNAbe
sequences.
used The to p
tome and rDNA mitochondrial
cistron genome has been underutilized
reference in plant system-
sequences fo
guided assembly for these atics, due totargets
the highly conservedin nature more
of its coding loci, cou-
shallow
species. The more deeply pled withsequenced
highly divergent noncodingindividual
regions and ubiquitous wil
vide a wealth of information rearrangements. Likewise,
for repeats other than rDNA are rarely of t
exploration
drial and nuclear genomes and
used in plant systematics, due todevelopment
their rapid rate of sequence o
nuclear genes for further evolution and lack of sequence information
studies. The from most taxa.
multiplex l
will depend on the genomic The approaches target,
described should accelerate the utilization of
sequencing cap
sequencer, and the nuclear mtDNA and other genomeclasses of repetitive DNA in phylogenetic
size, if nuclea
desired, for any specific analyses and increase the(Fig.
study spectrum of9).nuclear loci available for
In addition to the technical comparative studies across the land plants.
aspects Admittedly, the ap- here
discussed
matics and downstream proachdata
presented hereanalysis also
is only the tip of the iceberg present
for what can
challenges (McPherson, be done 2009;
with the incredible capacity of NGSand
Steele technologies and
Pires,
are an ever-increasing for genomics in nonmodel of
number organisms. In crop plants and aavailab
options small
read analysis, a discussion number ofof plant models
which ( Arabidopsisis, Helianthus
beyond L., Mimulus the
paper. However, new applications L.), genomics is rapidly advancingand the goal of services
linking the geno- are
being developed to bring type andgenome
phenotype (Song and scale Mitchell-Olds,data
2011). Future mani
phylogenomic analyses developments into the in third-generation
reach sequencing
of technologies
every lab
budget with both open(Munroe source and Harris, 2010)
(e.g., and beyond will ultimately lower
Galaxy: Blanke
2010; Goecks et al., 2010; CIPRES: Miller et al., 2010; iPlant the cost of sequencing to the point that full-scale comparative
collaborative: Goff et al., 2011) and commercial options (e.g., genomics, including the evolution of plant morphological char-
Geneious: Biomatters Ltd., Auckland, New Zealand; CLC Ge- acters and physiological traits, will be possible for any plant
nomics Workbench: CLC-bio, Cambridge, Massachusetts; group of interest.
Amazon Web Services cloud: http://aws.amazon.com/) for
analysis software and hardware access (Schadt et al., 2010). LITERATURE CITED

Conclusions - Researchers targeting the nuclear genome Alvarez, I., and J. F. Wendel. 2003. Ribosomal ITS sequences and
plant phylogenetic inference. Molecular Phylogenetics and Evoluti
generally consider the plastome and mitochondrial genome as 29: 417-434.
"contamination" (Lutz et al., 2011), and protocols have been Arthofer, W., S. Schuler, F. M. Steiner, and В. С. Schlick-Steiner.
developed to minimize the percentage of these organelles in ge- 2010. Chloroplast DNA-based studies in molecular ecology may
nomic DNA (Gore et al., 2009; Carrier et al., 201 1; Lutz et al., be compromised by nuclear-encoded plastid sequence. Molecular
201 1). In parallel, transcriptome studies take steps to reduce the Ecology 19: 3853-3856.
abundance of ribosomal RNA in their samples (this volume: Baucom, R. S., J. C. Estill, C. Chaparro, N. Upshaw, A. Jogi, J. M.
Cronn et al., 2012). Ironically, these same plastid and rDNA Deragon, R. P. Westerman, et al. 2009. Exceptional diversity,
"contaminants" have historically been the most important tar- non-random distribution, and rapid evolution of retroelements in the
B73 maize genome. PLOS Genetics 5: e 1000732.
gets for comparative sequencing studies in plant systematics.
Bennett, M. D., and I. J. Leitch. 2011. Nuclear DNA amounts in an-
Our simulations and empirical results demonstrate that plant
giosperms: targets, trends and tomorrow. Annals of Botany 107:
systematists can take advantage of this fact to obtain sufficient 467-590.
Bentley, D. R., S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J.
Milton, C. G. Brown, K. P. Hall, et al. 2008. Accurate whole
human genome sequencing using reversible terminator chemistry.
Nature 456: 53-59.
Blair, C., and R. W. Murphy. 2011. Recent trends in molecular phylo-
genetic analysis: Where to next? Journal of Heredity 102: 130-138.
Blankenberg, D., G. V. Küster, N. Coraor, G. Ananda, R. Lazarus,
M. Mangan, A. Nekrutenko, et al. 2010. Galaxy: A web-based
genome analysis tool for experimentalists. Current Protocols in
Molecular Biology 89: 19.10.1-19.10.21. doi: 10. 1002/0471 142727.
mbl910s89
Carrier, G., S. Santoni, M. Rodier-Goud, A. Canaguier, А. Коснко, С.
Dubreuil-Tranchant, P. This, et al. 2011. An efficient and rapid
protocol for plant nuclear DNA preparation suitable for next genera-
tion sequencing methods. American Journal of Botany 98: el 3-е 15.
Castle, J., M. Biery, H. Bouzek, T. Xie, R. Chen, K. Misura, S. Jackson,
et al. 2010. DNA copy number, including telomeres and mitochon-
dria, assayed using next-generation sequencing. BMC Genomics 1 1 :
244.
Craig, D. W., J. V. Pearson, S. Szelinger, A. Sekar, M. Redman, J. J.
Corneveaux, T. L. Pawlowski, et al. 2008. Identification of genetic
Fig. 9. Levels of sample multiplexing for different genomic targets variants using bar-coded multiplexed sequencing. Nature Methods 5:
and sequencing capacity. Multiplex calculations are based on the median 887-893.
genome size for angiosperms (Bennett and Leitch, 2011). Sequencing ca- Cronn, R., B. Knaus, A. Liston, P. J. Maughan, M. Parks, J. Syring,
pacity is based on Illumina product literature and our own experience for and J. Udall. 2012. Targeted enrichment strategies for Next-
the GAIIx and HiSeq2000 platforms. Generation plant biology. American Journal of Botany 99: 291-312.

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 363

Cronn, R., A. Liston, M. Parks, D. S. Gernandt, R. Shen, Катон,


and T. K.,Mockler.
and H. Тон. 2008. Recent developments in the MAFFT
2008. Multiplex sequencing of plant chloroplast genomes multipleusing
sequence alignment program. Briefings in Bioinformatics 9:
Solexa sequencing-by-synthesis technology. Nucleic Acids 286-298.
Research
36: el22. Kent, W. J. 2002. BLAT - The BLAST-like alignment tool. Genome
Degnan, J., and N. Rosenberg. 2009. Gene tree discordance, phyloge- Research 12: 656-664.
Kircher,
netic inference and the multispecies coalescent. Trends in Ecology & M., P. Heyn, and J. Kelso. 2011. Addressing challenges in the
Evolution 24: 332-340. production and analysis of Illumina sequencing data. BMC Genomics
Delcher, A. L., A. Phillippy, J. Carlton, and S. L. Salzberg. 2002. Fast
12: 382.
Knaus, В. J. 2010. Short read toolbox. Website http://brianknaus.com/
algorithms for large-scale genome alignment and comparison. Nucleic
Acids Research 30: 2478-2483. software/srtoolbox/ [accessed 6 June 201 1].
Dohm, J. C., C. Lottaz, T. Borodina, and H. Himmelbauer. 2008. Knoop, V., U. Volkmar, J. Hecht, and F. Grewe. 2011. Mitochondrial
Substantial biases in ultra-short read data sets from high-throughput genome evolution in the plant lineage. In F. Kempken [ed.], Plant
DNA sequencing. Nucleic Acids Research 36: el05. mitochondria, vol. 1, Advances in plant biology, 3-29. Springer, New
Duarte, J. M., P. K. Wall, P. P. Edger, L. L. Landherr, H. Ma, J. York, New York, USA.
C. Pires, J. Leebens-Mack, et al. 2010. Identification of shared Kurtz, S., A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C.
single copy nuclear genes in Arabidopsis, Populus , Vitis and Oryza Antonescu, and S. L. Salzberg. 2004. Versatile and open soft-
and their phylogenetic utility across various taxonomie levels. BMC ware for comparing large genomes. Genome Biology 5: R12.
Evolutionary Biology 10: 61. Leinonen, R., H. Sugawara, and M. Shumway. 2011. The sequence
Durstenfeld, R. 1964. Algorithm 235: Random permutation. read archive. Nucleic Acids Research 39: D19-D21.
Communications of the ACM 7: 420. Lennon, N. J., R. E. Lintner, S. Anderson, P. Alvarez, A. Barry, W.
Edwards, S.V. 2009. Is a new and general theory of molecular systemat- Brockman, R. Daza, et al. 2010. A scalable, fully automated pro-
ics emerging? Evolution 63: 1-19. cess for construction of sequence-ready barcoded libraries for 454.
Feliner, G. N., and J. A. Rossello. 2007. Better the devil you know? Genome Biology 11: R15.
Guidelines for insightful utilization of nrDNA ITS in species-level Li, H., and R. Durbin. 2009. Fast and accurate short read alignment with
evolutionary studies in plants. Molecular Phylogenetics and Evolution Burrows- Wheeler transform. Bioinformatics 25: 1754-1760.
44:911-919.
Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer,
Fishbein, M., D. Chuba, C. Ellison, R. J. Mason-Gamer, and S. P. G. Lynch.
Marth, et al. 2009. The Sequence Alignment/Map format and
201 1. Phylogenetic relationships of Asclepias (Apocynaceae) inferred
SAMtools. Bioinformatics 25: 2078-2079.
from non-coding chloroplast DNA sequences. Systematic Botany 36: N. Homer. 2010. A survey of sequence alignment algo-
Li, Н., and
1008-1023.
rithms for next-generation sequencing. Briefings in Bioinformatics
Ganley, A. R. D., and T. Kobayashi. 2007. Highly efficient concerted
11:473-483.
evolution in the ribosomal DNA repeats: Total rDNA repeat variation
Li, R., Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang.
revealed by whole-genome shotgun sequence data. Genome Research 2009. SNP detection for massively parallel whole-genome rese-
17: 184-191.
quencing. Genome Research 19: 1124-1132.
Givnish, T. J., M. Ames, J. R. McNeal, M. R. McKain, P. R. Steele,
Lutz, К., W. Wang, A. Zdepski, and T. Michael. 2011. Isolation and
C. W. dePamphilis, S. W. Graham, et al. 2010. Assembling the
analysis of high quality nuclear DNA with reduced organellar DNA
tree of the monocotyledons: Plastome sequence phylogeny and
for plant genome sequencing and resequencing. BMC Biotechnology
evolution of Poales. Annals of the Missouri Botanical Garden 97:11:54.
584-616.
Malhis, N., and S. J. M. Jones. 2010. High quality SNP calling using
Goecks, J., A. Nekrutenko, J. Taylor, and T. G. Team. 2010. Galaxy:
Illumina data at shallow coverage. Bioinformatics 26: 1029-1035.
A comprehensive approach for supporting accessible, reproducible,
McPherson, J. D. 2009. Next-generation gap. Nature Methods 6: S2-S5.
and transparent computational research in the life sciences. Genome
Meyer, M., and M. Kircher. 2010. Illumina sequencing library prepara-
Biology 1 1 : R86.
tion for highly multiplexed target capture and sequencing. Cold Spring
Goff, S. A., M. Vaughn, S. McKay, E. Lyons, A. E. Stapleton, D.
Harbor Protocols 2010: pdb.prot5448.
Gessler, N. Matasci, et al. 2011. The iPlant collaborative:
Meyers, S., and A. Liston. 2010. Characterizing the genome oí wild rela-
Cyberinfrastructure for plant biology. Frontiers in Plant Science 2:
tives of Limnanthes alba (meadowfoam) using massively parallel se-
34.
quencing. Acta Horticulturae 859: 309-314.
Goloboff, P. A., J. S. Farris, and К. C. Nixon. 2008. TNT, a tree pro-
Miller, M. A., W. Pfeiffer, and T. Schwartz. 2010. Creating the CIPRES
gram for phylogenetic analysis. Cladistics 24: 774-786.
Science Gateway for inference of large phylogenetic trees. Gateway
Gordon, A. 2011. FASTX-Toolkit. Computer program distributed by
Computing Environments Workshop (GCE), 2010: 1-8.
the author, website http://hannonlab.cshl.edu/fastx_toolkit/index.html
Milne, I., M. Bayer, L. Cardle, P. Shaw, G. Stephen, F. Wright, and D.
[accessed 15 July 2011].
Marshall. 2010. Tablet - Next generation sequence assembly visu-
Gore, M. A. W., M. H. Ersoz, E. S. Bouffard, P. Szekeres, E. S. Jarvie,
T. P. Hurwitz, B. L. Narechania, et al. 2009. Large-scale discovery alization. Bioinformatics 26: 401^402.
of gene-enriched SNPs. Plant Genome 2: 121. Ming, R., S. Hou, Y. Feng, Q. Yu, A. Dionne-Laporte, J. H. Saw, P.
Guisinger, M. M., J. V. Kuehl, J. L. Boore, and R. K. Jansen. 2011. Senin, et al. 2008. The draft genome of the transgenic tropical fruit
Extreme reconfiguration of plastid genomes in the angiosperm family tree papaya ( Carica papaya Linnaeus). Nature 452: 991-996.
Geraniaceae: Rearrangements, repeats, and codon usage. Molecular Moore, M. J., P. S. Soltis, C. D. Bell, J. G. Burleigh, and D. E. Soltis.
Biology and Evolution 28: 583-600. 2010. Phylogenetic analysis of 83 plastid genes further resolves the
Hall, T. A. 1999. BioEdit: A user-friendly biological sequence align- early diversification of eudicots. Proceedings of the National Academy
ment editor and analysis program for Windows 95/98/NT. Nucleic of Sciences, USA 107: 4623-4628.
Acids Symposium Series 41 : 95-98. Munroe, D. J., AND T. J. R. Harris. 2010. Third-generation sequencing
Harismendy, O., P. Ng, R. Strausberg, X. Wang, T. Stockwell, К. fireworks at Marco Island. Nature Biotechnology 28: 426-428.
Beeson, N. Schork, ET AL. 2009. Evaluation of next generation Nixon, К. C. 1999-2004. WinClada v. 1.7. Computer program distributed
sequencing platforms for population targeted sequencing studies. by the author, Ithaca, New York, USA.
Genome Biology 10: R32. Nock, C. J., D. L. Waters, M. A. Edwards, S. G. Bowen, N. Rice, G. M.
Jennings, T. N., B. J. Knaus, T. D. Mullins, S. M. Haig, and R. C. Cordeiro, and R. J. Henry. 2011. Chloroplast genome sequences
Cronn. 2011. Multiplexed microsatellite recovery using massively from total DNA for plant identification. Plant Biotechnology Journal
9: 328-333.
parallel sequencing. Molecular Ecology Resources 11: 1060-1067.

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
364 American Journal of Botany

Parks, M., R. Cronn, and A.Shulaev,


Liston.V., D. J. Sargent,
2009. R. N.Increasing
Crowhurst, Т. C. Mockler, O. Folkerts,
phyloge
A. L.
lution at low taxonomie levels Delcher, P.massively
using Jaiswal, et al. 2011. The genome of woodland
parallel sequen
strawberrv (Fragaria
chloroplast genomes. BMC Biology 7: vesca).
84.Nature Genetics 43: 109-1 16.
Parks, M., A. Liston, and Song,Cronn.
R. B.-H., and T. Mitchell-Olds.
2010. 2011. Evolutionary and
Meeting theecological
cha
non-referenced genome genomics of
assembly non-model plants.
from Journal of Systematic
short-read s and Evolutionda
sequence
Horticulturae 859: 323-332. 49: 17-24.
Quail, M. A., I. Kozarewa, F. Smith, A. Scally, P. J. Stephens, R. Stage, D. E., and T. H. Eickbush. 2007. Sequence variation within the
Durbin, H. Swerdlow, et al. 2008. A large genome center's im- rRNA gene loci of 12 Drosophila species. Genome Research 17:
provements to the Illumina sequencing system. Nature Methods 5: 1888-1897.
1005-1010. Stamatakis, A. 2006. RAxML-VI-HPC: Maximum likelihood-based
phylogenetic analyses with thousands of taxa and mixed models.
Rambaut, A. 2006-2009. FigTree v. 1 .3. 1 . Computer program distributed by
the author, website http://tree.bio.ed.ac.uk [accessed 06 June 201 1]. Bioinformatics 22: 2688-2690.
Ratan, A. 2009. Assembly algorithms for next-generation sequence Stamatakis,
data. A., P. Hoover, and J. Rougemont. 2008. A rapid boot-
Ph.D. dissertation, Pennsylvania State University, University Park, strap algorithm for the RAxML Web servers. Systematic Biology 57:
Pennsylvania, USA. 758-771.
Ratan, A., Y. Zhang, V. M. Hayes, S. C. Schuster, and W. Miller. 2010.Steele, P. R., and J. C. Pires. 2011. Biodiversity assessment: State-
Calling SNPs without a reference sequence. BMC Bioinformatics 1 1 of-the-art
: techniques in phylogenomics and species identification.
130. American Journal of Botany 98: 415-^-25.
Rauwolf, U., H. Golczyk, S. Greiner, and R. G. Herrmann. 2010. Straub, S. С. К., M. Fishbein, T. Livshultz, Z. Foster, M. Parks, К.
Variable amounts of DNA related to the size of chloroplasts III. Weitemier, R. С. Cronn, et al. 2011. Building a model: Developing
Biochemical determinations of DNA amounts per organelle. Molecular genomic resources for common milkweed ( Asclepias syriaca) with low
Genetics and Genomics 283: 35-47. coverage genome sequencing. BMC Genomics 12: 211.
Rogers, S. O., and A. J. Bendich. 1987. Ribosomal-RNA genesSwofford,
in D. W. 2003. PAUP*: Phylogenetic analysis using parsi-
plants - Variability in copy number and in the intergenic spacer. Plantmony (*and other methods), version 4.0b 10. Sinauer, Sunderland,
Molecular Biology 9: 509-520. Massachusetts, USA.
Rowan, В., D. Oldenburg, and A. Bendich. 2009. A multiple-methodTamura,
ap- K., D. Peterson, N. Peterson, G. Stecher, M. Nei, and S.
proach reveals a declining amount of chloroplast DNA during develop- Kumar. 2011. MEGA5: Molecular evolutionary genetics analysis
ment in Arabidopsis. BMC Plant Biology 9:3. using maximum likelihood, evolutionary distance, and maximum par-
Schadt, E. E., M. D. Linderman, J. Sorenson, L. Lee, and G. P. Nolan. simony methods. Molecular Biology and Evolution 28: 2731-2739 .
2010. Computational solutions to large-scale data managementWhittall,
and J. В., J. Syring, M. Parks, J. Buenrostro, C. Dick, A.
analysis. Nature Reviews Genetics 11: 647-657. Liston, and R. Cronn. 2010. Finding a (pine) needle in a haystack:
Schatz, M. С., A. L. Delcher, and S. L. Salzberg. 2010. AssemblyChloroplast genome sequence divergence in rare and widespread
of large genomes using second-generation sequencing. Genomepines . Molecular Ecology 19: 100-114.
Research 20: 1165-1173. Woodson, R. E. 1954. The North American species of Asclepias L.
Shaw, J., E. B. Lickey, E. E. Schilling, and R. L. Small. 2007. Comparison
Annals of the Missouri Botanical Garden 41: 1-211.
Wu, F., L. A. Mueller, D. Crouzillat, V. Petiard, and S. D. Tanksley.
of whole chloroplast genome sequences to choose noncoding regions
for phylogenetic studies in angiosperms: The tortoise and the hare 2006.
III. Combining bioinformatics and phylogenetics to identify large
American Journal of Botany 94: 275-288. sets of single-copy orthologous genes (COSH) for comparative, evo-
Shendure, J., and H. Ji. 2008. Next-generation DNA sequencing.
lutionary and systematic studies: A test case in the euasterid plant
Nature Biotechnology 26: 1 135-1 145. clade. Genetics 174: 1407-1420.

This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms

You might also like