Wiley American Journal of Botany: This Content Downloaded From 181.231.29.59 On Tue, 26 Mar 2019 21:10:07 UTC
Wiley American Journal of Botany: This Content Downloaded From 181.231.29.59 On Tue, 26 Mar 2019 21:10:07 UTC
systematics
Author(s): Shannon C. K. Straub, Matthew Parks, Kevin Weitemier, Mark Fishbein, Richard
C. Cronn and Aaron Liston
Source: American Journal of Botany, Vol. 99, No. 2, Methods and Applications of Next-
Generation Sequencing in Botany (February 2012), pp. 349-364
Published by: Wiley
Stable URL: https://www.jstor.org/stable/41415366
Accessed: 26-03-2019 21:10 UTC
REFERENCES
Linked references are available on JSTOR for this article:
https://www.jstor.org/stable/41415366?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Wiley is collaborating with JSTOR to digitize, preserve and extend access to American
Journal of Botany
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
L3 Л 7 American Journal of Botany 99(2): 349-364. 2012.
Navigating the tip of the genomic iceberg:
Next-generation sequencing for plant systematics1
Shannon С. K. Straub2-5, Matthew Parks2, Kevin Weitemier2, Mark Fishbein3,
Richard C. Cronn4, and Aaron Liston2
department of Botany and Plant Pathology, Oregon State University, 2082 Cordley Hall, Corvallis, Oregon 97331 USA;
3Department of Botany, Oklahoma State University, 104 Life Sciences East, Stillwater, Oklahoma 74078 USA; and
4Pacific Northwest Research Station, USDA Forest Service, 3200 SW Jefferson Way, Corvallis, Oregon 97331 USA
• Premise of the study: Just as Sanger sequencing did more than 20 years ago, next-generation sequencing (NGS) is poised to
revolutionize plant systematics. By combining multiplexing approaches with NGS throughput, systematists may no longer
need to choose between more taxa or more characters. Here we describe a genome skimming (shallow sequencing) approach
for plant systematics.
• Methods: Through simulations, we evaluated optimal sequencing depth and performance of single-end and paired-end short
read sequences for assembly of nuclear ribosomal DNA (rDNA) and plastomes and addressed the effect of divergence on
reference-guided plastome assembly. We also used simulations to identify potential phylogenetic markers from low-copy
nuclear loci at different sequencing depths. We demonstrated the utility of genome skimming through phylogenetic analysis of
the Sonoran Desert clade (SDC) of Asclepias (Apocynaceae).
• Key results: Paired-end reads performed better than single-end reads. Minimum sequencing depths for high quality rDNA and
plastome assemblies were 40x and 30x, respectively. Divergence from the reference significantly affected plastome assembly,
but relatively similar references are available for most seed plants. Deeper rDNA sequencing is necessary to characterize
intragenomic polymorphism. The low-copy fraction of the nuclear genome was readily surveyed, even at low sequencing
depths. Nearly 160000 bp of sequence from three organelles provided evidence of phylogenetic incongruence in the SDC.
• Conclusions: Adoption of NGS will facilitate progress in plant systematics, as whole plastome and rDNA cistrons, partial
mitochondrial genomes, and low-copy nuclear markers can now be efficiently obtained for molecular phylogenetics studies.
Key words: Asclepias ; chloroplast content; genome skimming; Illumina; next-generation sequencing; low-copy nuclear
gene; plant systematics; plastome; reference-guided assembly; ribosomal DNA cistron.
For almost two decades, plant systematists have used 2010) DNA and nuclear genome (e.g., Shulaev et al., 2011) have
sequences to characterize and classify plant diversity. These demonstrated the potential for larger nucleotide data sets to
studies have had a profound influence on our understanding increase
of the resolution of, and support for, phylogenetic hy-
plant phylogenetic relationships and patterns of diversification. potheses. The increasing accessibility and affordability of next-
This progress is all the more remarkable when one considers generation sequencing (NGS) provides an impetus for plant
that most of these studies used only hundreds to thousands systematists
of to transition from Sanger sequencing of a small
nucleotides, representing a tiny fraction of the millions number to bil- of loci to genome sequencing that provides access to
lions of base pairs in a land plant genome. Phylogenetic analy- kilobases of data from three organelles (chloroplast, mitochon-
ses using kilobase- scale sampling of the chloroplast (e.g.,drion, Parks,and nucleus) in a single run (Steele and Pires, 201 1). The
Cronn, and Liston, 2009; Givnish et al., 2010; Moore combination et al., of the immense yield of the currently most cost
effective platform (Illumina HiSeq 2000) with multiplexing ap-
proaches (Cronn et al., 2008; Meyer and Kircher, 2010) means
1 Manuscript received 19 July 201 1; revision accepted 7 October 201 1.
that systematists may no longer need to choose between more
The authors thank W. Phippen (Western Illinois University) and T. more taxa.
loci or
Livshultz (Academy of Natural Sciences) for supplying leaf tissue; S.
In contrast to the Sanger sequencing traditionally used for
Lynch for providing DNA samples; M. Dasenko, M. Peterson, and C.
Sullivan (Oregon State University Center for Genome Research plantandmolecular systematics studies, NGS is quantitative. Thus,
in
Biocomputing) for Illumina sequencing and data analysis support; T. addition to obtaining the primary nucleotide sequence, NGS
Jennings and J. Swanson (USDA Forest Service) and L. Mealyapproaches
and N. can provide a count of the number of times each
Nasholm (Oregon State University) for laboratory assistance; Z.base is sequenced. This concept of sequencing depth is central
Foster
to theS.utilization of NGS data. If the target is a complete nuclear
and K. Hansen (Oregon State University) for data analysis assistance;
Meyers and T. Mockler (Oregon State University), M. Moore (Oberlin
genome, sequencing depth can be estimated as the total number
College), and D. Soltis (University of Florida) for access to unpublished
of base pairs obtained divided by the genome size. However,
data; and B. Knaus (USDA Forest Service) and L. Wilhelm (Oregon State
this calculation is only accurate for parts of the genome that are
University) for access to perl scripts. The authors acknowledge funding
present in a single copy. Repetitive sequences comprise a sub-
from a U. S. National Science Foundation Systematic Biology Program
stantial fraction of plant genomes, and the sequencing depth of
grant (DEB 0919583) to R.C.C., M.F., and A.L.
a repeat will rise in proportion to its copy number in the ge-
5 Author for correspondence (e-mail: [email protected])
nome. For example, transposable element content in sequenced
doi:10.3732/ajb. 1100335 plant genomes ranges from 14% in papaya (Ming et al., 2008)
American Journal of Botany 99(2): 349-364, 2012; http://www.amjbot.org/ © 2012 Botanical Society of America
349
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
350 American Journal of Botany [Vol. 99
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 35 1
involving
the reference) for the Asclepias reference and "medium" (>85% mitochondrial
identity toorthe
nuclear pseudogenes) or very high sequencing
reference) for the other three references. For each of the depth
three(possible misassembled
divergent refer- repeats) relative to the surrounding region and
ences, assembled contig sequences were then aligned to problem the A. areassyriaca
were masked.
plastome
sequence using the alternate-ref option in Alignreads. Consensus An iterativeplastome
approach was se- used to obtain the rDNA cistron sequences for
quences were masked at positions with sequencing depth each less than
individual offive reads
the SDC. First, reads from each individual were aligned
and single nucleotide polymorphisms (SNPs) vs. the reference against the were masked
A. syriaca if (GenBank JF3 12046) with the external tran-
reference
sequencing depth was not at least 25 with a call proportion of 0.8
scribed spacer (Whittall spacer trimmed. A low masking threshold
and nontranscribed
et al., 2010). Three regions prone to misassembly, two involving mononucleotide
was used to allow the consensus to differ from the reference (Alignreads option
repeats and one involving the accD pseudogene (see Straub et The
-w 1 0.5). al., A.
201 1), were
subulata 423 assembly resulted in a single contig with few
removed from each assembly prior to downstream analyses. ambiguous base calls, and
Additional so was
sites in chosen to act as a new SDC reference. Reads
the assembly consensus sequence were then masked by from handeach in SDC
the individual
alignment were then aligned to this A. subulata reference in
editor BioEdit v. 7.0.5.3 (Hall, 1999) for the following Alignreads three cases: (1) stringent
with more gaps in masking parameters (-w 5 -x 25 0.5).
coverage of the reference due to gaps between contigs, (2) introduced
Partial ambigu-
mitochondrial genomes were assembled using the five longest con-
ity codes due to incorrectly assembled indels or SNPs tigsbetween
from the A.overlapping
syriaca mitochondrial genome assembly (Straub et al., 201 1) as
contigs, or (3) long stretches of sequence divergent from references with separate
the reference assemblies for each contig. Because of the expected
present
at contig ends. The amount of unmasked consensus sequence, presencenumber
of chloroplast DNA in plant mitochondrial genomes (Knoop et al.,
of con-
tigs, longest contig, and N50 for each assembly were determined using
201 1), these contigs BioEdit
were screened for similarity to chloroplast sequences using
BLAT, but none was detected.
and a perl script, contig-stats.pl (available at http://milkweedgenome.org). Dif- Because of the low sequencing depth, the mito-
ferences in these metrics were determined using a one-way ANOVA
chondrial assemblies wereplus
masked only for positions with less than five cover-
Tukey-Kramer HSD test to correct for multiple comparisons ing reads. Consensus sequences
implemented in were trimmed to match the lengths of the
JMP 9.0.2. references and masked where there were gaps between the contigs when aligned
to the references and places where the contig sequences showed little similarity
to the reference, likely indicating rearrangements.
Genome skimming for phylogenetics in the Sonoran Desert clade of milk-
weeds ( Asclepias ) - Sampling, DNA extraction, and Illumina sequencing -
Eleven individuals from the six species of the Sonoran Desert clade of Asclepias Sequence alignment and phylogenetic inference - Consensus plastome se-
quences were then aligned using the program MAFFT v. 6.857 with default
(clade I plus Asclepias cutleri and A. leptopus; Fishbein et al., 201 1), including
one hybrid individual, and three individuals of two outgroup species (A. coul- settings. Alignment for the mitochondrial sequences was trivial because of high
teri and A. macrotis) were sampled (Table 1). Genomic DNA was extracted and sequence similarity and previous alignment to the references with the program
prepared for sequencing on the Illumina GAIIx platform following Straub et NUCmer al. v. 3.07 (Delcher et al., 2002; Kurtz et al., 2004) as part of the Align-
(2011) with the following modifications: (1) fragmented DNA was ligated reads to pipeline. The rDNA cistron sequences were aligned using local pair align-
adapters carrying unique 4-6-bp "barcodes" (Craig et al., 2008; Cronn et al., ment with 1000 iterations in MAFFT v. 6.240. One ambiguous area where two
2008) to allow multiplexing, (2) reactions were done at one half the recom- contigs poorly overlapped in A. cutleri 382 required manual curation from the
read pileup. Each alignment also contained the A. syriaca reference sequence
mended volume, and (3) in several cases agarose gel-based size selection was
used in the assembly process to serve as an additional outgroup. All alignments
not used. In the latter case, we relied on sonication to minimize large molecular
weight fragments (>1000 bp) and purification with Agencourt AMPure (Beck- were corrected by eye. Areas of likely misassembly (polynucleotide runs, AT-
rich repeats, the accD pseudogene), areas with inversions in some sequences,
man Coulter Genomics, Danvers, Massachusetts, USA) to minimize low mo-
lecular weight fragments (<200 bp), including adapter/primer dimer
including differences due to the shifting boundary of the inverted repeat, and
contamination (Lennon et al., 2010). AMPure was used both after ligationother of difficult to align regions were removed from the plastome matrix prior to
adapters and after PCR enrichment at ratios of 0.7-1.1 : 1 AMPure to reaction phylogenetic analysis. The matrices for the five mitochondrial references were
concatenated into a single matrix using the program WinClada v. 1 .7 (Nixon,
volume. The specific ratio varied depending on the fragmentation profile of the
1999-2004) and sequences with more than 30% missing data were removed.
library as determined by gel electrophoresis after sonication and after enrich-
Plastome and rDNA cistron sequences were submitted to GenBank (Table 1).
ment (i.e., higher ratios were used for relatively high molecular weight librar-
ies, and vice versa). Libraries were quantified using a Qubit fluorometer Mitochondrial contig sequences are available at http://milkweedgenome.org.
(Invitrogen by Life Technologies, Carlsbad, California, USA) and multiplexed The numbers of variable and informative characters were calculated in MEGA
in equimolar ratios. Eighty base-pair single-end reads were sequenced on 5.05.an Each matrix was then analyzed separately using maximum parsimony
Illumina GAIIx sequencer (Illumina, San Diego, California, USA) at the Center (MP) and maximum likelihood (ML) optimality criteria.
for Genome Research and Biocomputing at Oregon State University. Resulting Parsimony analyses were conducted in the program TNT v. 1.1 (Goloboff
data were analyzed using Illumina' s Real Time Analysis v. 1.6 or 1.8 and
et al., 2008) using implicit enumeration (branch and bound) to find exact solutions
CASAVA v. 1.6 or 1.7. Adapter reads were removed and barcoded reads sorted and with uninformative characters deactivated. The resulting trees were viewed
using perl scripts (Knaus, 2010). and characterized using WinClada v. 1.7. The program PAUP* v. 4.0Ы0
The sequencing depths for rDNA and the plastid and mitochondrial ge- (Swofford, 2003) was used to perform 10000 branch and bound bootstrap rep-
nomes were determined using the BLAT assembler (Kent, 2002) with default licates with all characters activated. Parsimony bootstrap support was calcu-
lated for the most likely tree topology from the ML analyses.
parameters to determine the number of unique read hits to the A. syriaca plas-
tome (GenBank JF433943) and rDNA (GenBank JF3 12046, JF3 12047) se- Maximum likelihood analyses were accomplished using the program
RAxML v. 7.2.6 (Stamatakis, 2006) estimating sequence evolution using the
quences and the five longest contigs from the A. syriaca mitochondrial genome
GTRGAMMA model. Of the models of molecular evolution available in
assembly (Straub et al., 2011). The approximate sequencing depth for the nu-
RAxML, we chose to use the full model, rather than those involving approxi-
clear genome was determined using the reads not identified as chloroplast or
mitochondrial and the genome size of A. syriaca because the genome sizes mations
of of the GTR plus gamma model for increased computational efficiency,
because our analysis could be run in a reasonable amount of time due to the
the Sonoran Desert clade species are unknown. All were assumed to be diploid
because no polyploids have been identified in Asclepias (2 n = 2x = 22). small number of terminals. The chloroplast and mitochondrial genome matrices
were unpartitioned, but the rDNA matrix was partitioned into ribosomal and
Sequence assembly and masking - Assemblies for each individual were ac- spacer region sequence. Support was assessed with 10000 rapid bootstrap rep-
complished using the program Alignreads v. 2.25 (Straub et al., 2011). Whole licates (Stamatakis et al., 2008). Trees were viewed in FigTree v. 1.3.1
plastomes were assembled using the A. syriaca reference (GenBank JF433943)(Rambaut, 2006-2009).
with only one copy of the inverted repeat present and the same masking param-
eters as described already for the divergent reference tests. The Alignreads out- Intragenomic rDNA polymorphism analysis - Read pools for each individ-
put was checked in BioEdit. Areas with multiple contig alignments indicating ual were filtered to remove reads with an average Phred quality score below
20 using the FASTX-Toolkit (Gordon, 201 1). Within the remaining reads, any
insertions or deletions vs. the reference were expanded or contracted to incor-
porate these differences. Read pileups for areas with many SNPs, introduced bases with a score below 20 were converted to 'N'. The quality filtered reads
IUPAC ambiguity codes, or other features that indicated possible misassembly were aligned against their own rDNA reference using BWA 0.5.7. From this
were checked in the program Tablet v. 1.11.05.03 (Milne et al., 2010). If any output, the proportion of reads at each position differing from the majority of
such problem was detected, the area was masked. Read pileups were also read calls was calculated. Only positions with >2% of reads differing from the
majority were considered polymorphic to minimize the impacts of sequencing
screened in Tablet for areas of very low sequencing depth (possible misassembly
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
352 American Journal of Botany [Vol. 99
л _ r^OOO'^0(NTl-m»r>40r-OOONO
p _ g 101Л1ПЧОЮЧО'О^ОЧО^ОЮЮЧО[^ errors, nuclear or mitochondrial rDNA pseudogenes, and potential f
î< §§§sãssãsãssss
о
phyte or fungal contaminant sequences (see Straub et al., 2011)
ČP
s
üQ >10% differing base calls were considered highly polymorphic.
о
* o ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
0 M NCO^iD^hOOaOr-iMcO^tin
Д •< 00000000000000000'0'ONO'<^ON
g Cd ^ OOOOOOOOOOOOOO
QQ Ä 1П>Л>0>0»01Г1101П>Л1П>0>01Л1П
Chloroplast and rDNA content of Apocynaceae Illumina librar
тз с í-J чоючочоюючочочочочочочочо proportions of plastid and rDNA in Illumina sequencing libraries fr
с <5 чЬ ЮЧОЧОЧОЧОЮЧОЧОЧОЧОЧОЧОЧОЮ
cd
С) g gg g g g g g g g g g g g cynaceae tribe Asclepiadeae samples (A. Liston, unpublished d
Q samples from other Apocynaceae tribes (S. Straub and T. Livshu
1
О
£
a,
a)
lished data) were calculated using BLAT as described above (Appe
Supplemental Data with the online version of this article). Librarie
«л
^ xxxxxxxxxxxxxx
< a'O<T)in00f0r-iTtiom40r^-Hin
quenced on the Illumina GAIIx or HiSeq 2000, and produced at l
«4-1
°
< ž a'O<T)in00f0r-iTtiom40r^-Hin '- - 80-bp or 100-bp sequence reads. To account for divergence from th
û
1) cpDNA and rDNA references, the estimated number of matchin
тз e
in increased by 5% for Apocynaceae, excluding Asclepias and othe
73
•в
deae, based on 0.046 pairwise divergence between the A. syriaca
Ö &
<u oleander plastomes. A ř-test assuming unequal variances implement
СЛ
D ^ xxxxxxxxxxxxxx 9.0.2 was used to test for differences between the values observed f
гн^10ЬОЮ10н'СМО'^0'^
Q
g deae and those observed
040t^rao40ooasin40inaNO^H -H -H for the
M rest
^4 of Apocynaceae.
fi
cd
Q -H -H M ^4
Vh Oh
о
О
С
о
со RESULTS
£
<и
& xxxxxxxxxxxxxx
43 l-'OMrJOI-^O^hfOrHOOoo
S-H
<N-ir-^t(Nmm004000in40unO
< -H(N HTtH'OriH -H - I (N Genome skimming in silico - Effects of sequencing depth
£ £
<a
and read type on assembly success - Nuclear genome sequenc-
^с4
_о
ing depths of 0.0 lx to 4x resulted in rDNA and plastome se-
"С
-о •s
л
quencing depths also spanning two orders of magnitude (Fig. 2).
с
л
о
'S xxxxxxxxxxxxxx Sequencing depth had a significant effect on both the number of
H ONCNONONt^ONO-^tOOt^^HOOOOCN
о
о Cd ^ООО'^ОсП'-нОО-нОО-н unambiguously called bases (rDNA single-end, Fs 36 = 34.7005,
iž öödödöööödöoöö
'I £
P < 0.0001; rDNA paired-end, F8>36 = 7650.309, > < 0.0001;
тз
с
cpDNA single-end, F8>36 = 57.0139^ P < 0.0001; cpDNA paired-
cd
end, F8 36 = 205.1690, P < 0.0001) and amount of unmasked
5 40-HTJ-(N(N<N'- <1Г)О^Г-Г--Г~-СЛ-н
СЛ
о ЧООООГ-ОО^^ЧО^ОО^СЛСОСЛ sequence based on the presence of five or more reads covering
HL
о
Tt(N4000O^HLn40-^tOr-r-r-f0
"2 ONfNinmM-hO'HOoriûowo the position in the consensus sequence (rDNA single-end, F8 36
Л r4OO-HCnOi-V00N00M0'O^t
= 149.1947, P < 0.0001; rDNA paired-end, F8 36 = 15.4803, P <
V-I
^ r-Н^Ч
S
о 0.0001; cpDNA single-end, F8 36 = 1196.508, P < 0.0001; cp-
С/Г
р
"ад DNA paired-end, F8 36 = 2749.501, P < 0.0001) for both the
13 с
rDNA and plastome references (Fig. 2). For the rDNA assem-
fi
С 0 се blies, ca. 38x sequencing depth was sufficient for obtaining a
6 HioooocniooN^^fninincio
D cd »OO^I>4DOOn-4C)^ONr-OOiOTt similar number of unambiguously called bases as the higher se-
e <tN <N
<NH <N
(N 1Л (S <N r)irihrtrtH(Ot
1-H 1-H assemblies, while only ca. 25x sequencing
quencing depth
о z
с
<L>
Q depth was required to have a similar number of bases with at
00
"3
Й
Dh
fi
least 5x sequencing depth as the higher sequencing depth as-
13 semblies. For the chloroplast assemblies, ca. 30x sequencing
с
cd depth provided assemblies with a similar number of unambigu-
e?
ously called bases at 5x minimum sequencing depth as the
о
43
Ë higher depth data subsamples. Note that duplicate reads were
О
cd
<D
1 < M < < <^^< << not removed from the data set. While this would have elimi-
•s
I Ёа s аааз 3 аЁй ааЁ nated PCR duplicates, biological duplicates would also have
«H
S
TD
<L> S šs.f-gsoprsss^sšs been eliminated. Removal of duplicate reads would slightly in-
^fi о со^о^чочою^г^тспОчочосп _ crease the number of sites with at least 5x sequencing depth and
'5 о .S .S cd .s .в .в ^ ^ .S .S 2
slightly .g ethe
decrease с number
§ _ of unambiguously called bases (S.
о
£> <U <D H-Ü
3^,0^-c - - ÜÜ '(Ü e - - '(Ü 'S - Ü 'S 'S Д e
л
•fi -fi Д »fi -fi Л д д -fi -fi д -fi -fi .fi О Straub, unpublished data).
cl ÍIh Л > [IH E £ J J E '£ J ií Ё Рч g< When all sequencing depths were considered, paired-end
D
e
тз
о 2 read data sets significantly increased the percentage of the ref-
Ы) fi 'd
cnCSinM-HhONOi'rHMHcncS Cd erence sequence assembled for both the rDNA and plastome
'о S Oo|3ooMcn^inioM^^(Noo -P
с
0)
Z O^OfnrtrHrH^rHCNM^'tN ^
û -C
references as measured by the percentage of unmasked bases
s
cr e (rDNA, S = 370.50, P < 0.0001; cpDNA, S = 359.50 P <
<D 4-5 О
ся
^ «á о Äro ,Р^~^ ^ ь
сл ^ р мН >< >< sets significantly increased this value for the rDNA reference
•2H
. 'o С К *С ^ S "й I ;2 ^ > S S Í ° ( S = -374.00, F < 0.0001). For the 0.5x and lx paired-end sub-
- 8, в а й 'С "С û, û û 5 ^ « <
Щ СЛ s -s 'S ^ й ^ S û, О û Ü û § J- J4 Ssamples, ^ S .y «greater
^ than 100% of the plastome reference was as-
PQ
<
•s(U
(U Ä^pSsg'llls'S't't« -s Q^300«j"~SKÜK>¡i>3i>3i>3Q 'S Оce
Q^300«j"~SKÜK>¡i>3i>3i>3Q Üû sembled because the paired-end information allowed assembly
H ^ ^ ^ ^ ^ ^ ПН ^ of small insertions of total length greater than small deletions
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 353
Fig. 2. Assembly success for Asclepias syriaca rDNA and plastome reference sequences at different sequenc
end read data subsamples. (B) rDNA reference, paired-end read data subsamples. (C) Plastome reference, sing
reference, paired-end read data subsamples. Simulated nuclear sequencing depth is given on the horizontal axis.
and standard deviation for the five data subsamples at each depth point for each high copy target is given in par
ing depth values. Dark gray bars: percentage of the assembled reference consisting of unambiguous base calls. Lig
reference with five or more reads supporting each base call (i.e., percentage of unmasked sequence). Error b
among subsets sharing a letter are not statistically significant after correction for multiple comparisons.
detected in this individual compared to the Divergence reference from the reference had a significant effect on as-
plastome
(Fig. 2D). sembly success in terms of the percentage of the plastome as-
sembled and number of contigs in the assembly, but the lengths
Assessing the utility of low coverage genomes for phylogenetic of the longest contig and N50 were unaffected by divergence
marker development - Significantly more COSH loci overall beyond the conspecific comparison (Fig. 4). Regions that were
(single-end, F8 36 = 7669.66, P < 0.0001; paired-end, F8>36 = often absent in consensus sequences derived from divergent
5737.84, P < 0.0001) and COSH loci with two or greater hits references corresponded to the most quickly evolving regions
per locus (single-end, F8 36 = 11 034.08, P < 0.0001; paired-end, of the chloroplast genome including intergenic spacers (e.g.,
Fs 36 = 15 576.86, P < 0.(3001) were hit with each increase in se- trnH(GUG)-psbA, rpl32-trnLfVAG' rpsl6-trnQ (UUG), ndhC-
quencing depth (Fig. 3), except in the case of loci with two or more trnV(lJAC' ndhF-rpl32), introns (e.g., rplló , ndhA ), and ycfl
hits per locus for the step between the 0.0 lx and 0.02x sequenc- (a putative pseudogene in Asclepias). Apart from unassembled
ing depths with single-end reads (P = 0.2471). The single-end regions, few differences were observed among the assembly
data sets produced more COSH hits ( S = 198.50, P = 0.01 12), but consensus sequences, nearly all of which were length differ-
there was no difference in the number of loci with two or more ences of 2-4 bp associated with mononucleotide repeats.
hits for single-end vs. paired-end data ( S = -7.500, P = 0.9291).
Genome skimming for phylogenetics in the Sonoran Desert
Effects of divergence from the reference on plastome assem- clade of milkweeds ( Asclepias ) - Illumina sequencing , se-
bly - Pairwise divergences between Asclepias and Nerium , Cof- quence assembly ; and alignment - Sequencing was successful
fea , and Nicotiana were 0.046, 0.098, and 0.108, respectively. for all 14 genomic libraries with input DNA amounts as low as
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
354 American Journal of Botany [Vol. 99
support
Fig. 3. Number of nuclear (Fig. 5A). The analysisorthologous
conserved of the rDNA matrix resultedsetin II (
five most parsimonious
detected among Asclepias syriaca trees (L = 175, microreads
Illumina Ci = 90, Ri = 94) and a at
quencing depths. (A) Single-end read
most likely tree (InL data
= -9260.123917), the subsamples.
topology of which (
read data subsamples. Dark did
gray bars:
not conflict with thetotal number
strict consensus of loci h
of most parsimonious
bars: number of loci with two or more microread hits. Error
trees (Fig. 5B). The analysis of the mitochondrial DNA matrix
standard deviations.
resulted in four most parsimonious trees (L = 251, Ci = 93, Ri
= 81) and a most likely tree (InL = -39012.221073) with mul-
tiple topological conflicts with the strict consensus of the most
83 ng (Table 1). The total nuclear sequencing depths rangedparsimonious trees (Fig. 5C). In the strict consensus of most
from 0.07-0.3X, rDNA sequencing depths from 53-636x, chlo-parsimonious trees, the positions of A. cutleri and A. leptopus
roplast sequencing depths from 56-300x, and mitochondrial
were reversed and had high bootstrap support, 100% and 98%
sequencing depths from 3-2 lx (Table 1). Nearly complete se- respectively. The parsimony analysis also indicated strong sup-
quence assemblies were obtained for the plastome and rDNA port (bootstrap 99%) for the sister group relationship between
cistron for all individuals. The percentage of masked bases in A. albicans 422 and A. subulata 411.
the plastome assemblies ranged from 0.01 to 1.63% (median
0.21%). The percentage of masked bases in the rDNA assem- Intragenomic rDNA polymorphism analysis - Every individ-
blies ranged from 0 to 0.99% (median 0.15%). The mitochon-
ual analyzed was at least somewhat polymorphic at many posi-
drial sequence assemblies were much less complete due to the tions and highly polymorphic in at least a few positions. Low
low sequencing depth obtained for most individuals. Additional
proportion polymorphisms (proportion of reads differing from
sites not masked for that reason had to be masked due to se-
the consensus 0.02 < TV < 0. 1 0) ranged from 44-36 1 polymorphic
quence rearrangements vs. the A. syriaca reference. The per- sites per individual, with a mean of 156 polymorphic sites (95%
centage of masked bases in the mitochondrial genomeCI 104-208). Highly polymorphic sites (proportion of differing
assemblies ranged from 1.38 to 83.71% (median 14.41%). Four reads >0.10) ranged from 4-84 per individual, with a mean of
sequences were removed from the matrix because they con-
tained >30% missing data, considering both gaps and masked
bases. With these sequences removed, the median percentage
of masked bases was 6.25%. In total, the three matrices con- Table 2. Sequence obtained from and variability observed in the three
genomes of the Sonoran Desert clade of Asclepias.
tained 159981 characters, 0.5% of which were parsimony in-
formative in the ingroup (Table 2; Appendices S2-S4, see
0f All samples Ingroup only
Supplemental Data with the online version of this article).
Genome sequences Characters Variable PICs Variable PICs
Phylogenetic inference - Parsimony and maximum likeli- Nuclear (rDNA) 15 5878 150 106 72 50
hood analyses of the plastome matrix resulted in one most par- Chloroplast 15 130440 2069 1157 1045 730
Mitochondrial 11 23663 232 50 71 15
simonious (length [L] = 2207, consistency index [Ci] = 94,
retention index [Ri] = 94) and most likely tree (InL = Totals 159981 2541 1313 1188 795
-194692.00399) of nearly identical topology and high bootstrap Note: PICs = parsimony informative cha
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 355
60/51
г A macrotis 149
100/100
*• A. macrotis 1 50
based on whole plastome sequences. The asterisk indicates the loca
the single topological difference between the maximum likelihoo
tree presented and the single maximum parsimony (MP) tree, whe
A. syriaca 4885
branching order of A. cutleri and A. leptopus is reversed. (B) ML
depicting phylogenetic relationships based on rDNA cistron sequ
5.0E-4
(C) ML tree depicting phylogenetic relationships based on partial
chondrial genome sequences. For (A-C), numbers above branches r
sent
Fig. 5. Phylogenetic relationships among species of theML/MP
Sonoranbootstrap
Desert support >50% for the ML topology. Sca
clade (SDC) of Asclepias (Apocynaceae). (A) Phylogenetic relationships
represent the number of substitutions per site.
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
356 American Journal of Botany [Vol. 99
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 357
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
358 American Journal of Botany [Vol. 99
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 359
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
360 American Journal of Botany [Vol. 99
S 22 "č3 S2 ^ 22 'ёЗ £ Id 23 22 ^
<о ^ о 'S о й о С О й о б о С о S о S о g fi
^ 0)0 0) e 0) e 0) G 0) в 0) CS о) fi 0) С О) fi
-fi о <U -г* гй U Ü О Р O<UO<UCJ<D 'ri -fi о <u о <u о <u
о сз *r¡ S гй 5 D л U ir! 03 ÎT> й Ь й Ь й h S fi -fi S^Ín
>.t;оо *r¡
g i» о,i»
о чч es
fiсзD'иnn U л Оч
о<т; g Л й Ont;
Л tí Л ïЛII OhT¿
й fi II3Он
,fiс сйт;Х>
,fi йat:,0Х>йOnt;
лч ,0
и Он лч и
2 0)^fiC(U1)
¿3 »fi Ъ cd cdО)
хз <U
.fi <D fixi
si xi fiesОesОxi
OJsz b ^
л b
£
Q O^OO^HfNON 00 OO О ON (N CH <N Г- ЧО
£j ГП ON <N Г-; <N <N iO СО О ON OO ЧО Г~- Г-
^ CO IO ^ Tf ON CH Tt'cn^-Hr^fN r-^
_Ç£ CN О 00 ^ IO 00 <N ON OO^O^fOO ON
'a oo on io m r-- oo г- чо чо т-н сн ^
$ ЧО <N ЧО Ю Г- О ON »O O O ^ (N
С »Domoo'-H^H ю - < ^ on TJ- o <n - < oo
гн о о »о сн чо rt tj- r» S ю 10 on »о »о о
гн § О О ЧО ЧО <N ЧО Z io ЧО 00 ЧО ^-н r-
0) io On on en m on io со со
tJ- ^ -н <N <N
<u
о
a в в
e es cd
^ 43 & Л cd ed ^
о
о
Он
^oj^DOOTíTíTSTa чз
< S'SI'SS о 'S 'S 'S 'S "Я
с 8 й = "з" « S I Sž -Č
а
S e з § 's e e ¡53 ¡S .a ¡S
£ ооt/3
?г.&|ьо&|2 2-о-о 3 .o _ x>
"I 3 3 ^ .o Д _ 1- I r-H *- li- I T-H r-j
5-1
g S ^ 3 .H .H Он Он CL Он Д OOOOO ft
<D
S O >;° ^ >; J Jfifi Он Он & CL С Он Д Й
о Q O . ."fi «-Г .fi >; ^ "T3 fi fi 5 fi ^ г г г г г fi
S2 S2 ^ 2 rt § С С n1 с ^ 'rt 'rt Id Id сз
2 С ! С I n1 i с ^ : 'rt ssssiu 'rt Id Id s с
сл
В
Ä 'S .2 'S £ £ -4 b S m м м м M -S
л
Он со H I 8 8 8 8 8 J
о c¿ E-; S S < < z zz z z <
св
cd
D
О,
ел
« â
cd
M тЗ ^3
а сЗ °о Tj- 1 ^ УО 'о ^ ü: ^ ^ л
03 Ttl^OOOOOO (N ON О OOOOO „22
è e Tt чо чо чо (N <N со On »fi cncncncncn fifi3
5 io t ^ î: >o ю oo oo .22 сосососпс*-) о -fi
r»n
^ ON^ ^^hOnON^H^H
O t O O 0 <N QOO^H
-fi <N -fi
(N (N
(N (N
<N(N
<N<N <N'S'SОн
.22 G
о
о 0 I I I ÍN fi io IO io «О Ю »_) fi
uuaauu UP& >,>ц>Н>Н>Н Г=>
s züfi < << < < с
о
с
D
00 CS
5 Й
cdcd S
--(
0
^ r*5 -2 --(
2á OQ a a a a a "s
В о ^ w a с
еккке'к)
ÊЧ-ч
Ч-ч 6 « -S 8 R J i R
3 « s J J s s « « s s
тЗ
1 fi ^2 H Ъ 4 % ? ?? ? ? e
Он
« S s § §■ « -è -è-ê •« -ê ^
S^^ .§§^§S§aQ
« .§§^§S§aQ S S1 а
I Iflf-c -g а § = §■ §
OlI Ol
сл
D
•§
J § oo.S 00 g S a ^ S bbbbb
1 qoo a o oo o o í
•'S
cd
С <D Он
g X) « я ^
e О S я °°<ťcn ON CO io <t- ON СП
J3 tí ^ ZZ^i-
o>o) ^ ^л ^^ лл лr°°<ťcn
л y onZ Z
4040
000'40¡9!40
m Tř onZZ<t-
oo ЧО
oo
O N ^ -H r-
TD сл
<U
О
с
<D
D
cr
<U
сл
-S д. 5 О ^ с ® ^ ^
-SД 3й J 1 д. и d О < a® э 5i
С
i й s a®
^ г- ~ о w -t-j OhZDO1^ ir5
Q .§ ^ -J 1 г- 1 ~ О -a § ^ и ! äi S. ^ -S
"Й
g, I § ö § ö 5 § I J g1 ^ "2§.§1д I
o,
o
сл 2? -2-3 g) 2? b о^^j.S"S
^ , g) ^'"S
b сз^ ^ a
Wh
S2 èS2
-S Q ^QS 'S
S <3
.'S^0)<3
^ „ ,«"è
<3Q S
^ "Ьа
"Ьа2 s
s af
^ ^^
S
! J §£§!'§ '§ g i á .s s
I ! Г QUO
l"i|l| Г I g 1<i
O
Ч-ч
O
с
o OOOÍ5 Í g-
'fi 1 §5
0
Оч S
cdcd S
cd cdcd
Й cd
ün< SQQ -d
щ
О) 0) о) ün ^ ск
-у щ
1 «uo-fi-fioío
о
cd
о
cd
о
о) cd
(U^'C
г
-S
OJ
v
d
^
O
cdcdccÇdjs cdí4ídrr.o>o)o>o)o> OJ ¿Й ^r>*rb5 O
co
^ ..cd
0)0>55jl)cu
О-cdОВS fi
2 oed ed
ed
S"řj
2 cd
oiîà'fi+Ûcdcdcdcdcd
SáifiO.S ed ^^ nZ oO«
fi ooDоО)
S ^J^cd
О) <ùu-UhO
о инУ сл?3 u-UhO
НЙя?
Ш
J cd ад
•filScd cd-ЬЗ ВЙВОЁЙfi
.fi О Й
-d ed
К ft Йcd ed
fi -d
go "řj
OO O•§Ocd
t2 В
.5 fi -С fi n ^ o еЗ cd oo cd
ЬЦнК
3 В «вь PU j i- I o¿! o¿! _) ^ o^ cuf^ &H с eu
H (2 < S S
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 361
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
362 American Journal of Botany [Vol. 99
Conclusions - Researchers targeting the nuclear genome Alvarez, I., and J. F. Wendel. 2003. Ribosomal ITS sequences and
plant phylogenetic inference. Molecular Phylogenetics and Evoluti
generally consider the plastome and mitochondrial genome as 29: 417-434.
"contamination" (Lutz et al., 2011), and protocols have been Arthofer, W., S. Schuler, F. M. Steiner, and В. С. Schlick-Steiner.
developed to minimize the percentage of these organelles in ge- 2010. Chloroplast DNA-based studies in molecular ecology may
nomic DNA (Gore et al., 2009; Carrier et al., 201 1; Lutz et al., be compromised by nuclear-encoded plastid sequence. Molecular
201 1). In parallel, transcriptome studies take steps to reduce the Ecology 19: 3853-3856.
abundance of ribosomal RNA in their samples (this volume: Baucom, R. S., J. C. Estill, C. Chaparro, N. Upshaw, A. Jogi, J. M.
Cronn et al., 2012). Ironically, these same plastid and rDNA Deragon, R. P. Westerman, et al. 2009. Exceptional diversity,
"contaminants" have historically been the most important tar- non-random distribution, and rapid evolution of retroelements in the
B73 maize genome. PLOS Genetics 5: e 1000732.
gets for comparative sequencing studies in plant systematics.
Bennett, M. D., and I. J. Leitch. 2011. Nuclear DNA amounts in an-
Our simulations and empirical results demonstrate that plant
giosperms: targets, trends and tomorrow. Annals of Botany 107:
systematists can take advantage of this fact to obtain sufficient 467-590.
Bentley, D. R., S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J.
Milton, C. G. Brown, K. P. Hall, et al. 2008. Accurate whole
human genome sequencing using reversible terminator chemistry.
Nature 456: 53-59.
Blair, C., and R. W. Murphy. 2011. Recent trends in molecular phylo-
genetic analysis: Where to next? Journal of Heredity 102: 130-138.
Blankenberg, D., G. V. Küster, N. Coraor, G. Ananda, R. Lazarus,
M. Mangan, A. Nekrutenko, et al. 2010. Galaxy: A web-based
genome analysis tool for experimentalists. Current Protocols in
Molecular Biology 89: 19.10.1-19.10.21. doi: 10. 1002/0471 142727.
mbl910s89
Carrier, G., S. Santoni, M. Rodier-Goud, A. Canaguier, А. Коснко, С.
Dubreuil-Tranchant, P. This, et al. 2011. An efficient and rapid
protocol for plant nuclear DNA preparation suitable for next genera-
tion sequencing methods. American Journal of Botany 98: el 3-е 15.
Castle, J., M. Biery, H. Bouzek, T. Xie, R. Chen, K. Misura, S. Jackson,
et al. 2010. DNA copy number, including telomeres and mitochon-
dria, assayed using next-generation sequencing. BMC Genomics 1 1 :
244.
Craig, D. W., J. V. Pearson, S. Szelinger, A. Sekar, M. Redman, J. J.
Corneveaux, T. L. Pawlowski, et al. 2008. Identification of genetic
Fig. 9. Levels of sample multiplexing for different genomic targets variants using bar-coded multiplexed sequencing. Nature Methods 5:
and sequencing capacity. Multiplex calculations are based on the median 887-893.
genome size for angiosperms (Bennett and Leitch, 2011). Sequencing ca- Cronn, R., B. Knaus, A. Liston, P. J. Maughan, M. Parks, J. Syring,
pacity is based on Illumina product literature and our own experience for and J. Udall. 2012. Targeted enrichment strategies for Next-
the GAIIx and HiSeq2000 platforms. Generation plant biology. American Journal of Botany 99: 291-312.
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
February 2012] Straub et al. - Genome skimming for plant systematics 363
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms
364 American Journal of Botany
This content downloaded from 181.231.29.59 on Tue, 26 Mar 2019 21:10:07 UTC
All use subject to https://about.jstor.org/terms