Developing The Protocol Infrastructure For DNA Seq
Developing The Protocol Infrastructure For DNA Seq
doi: 10.3897/BDJ.11.e102317
Methods
Citation: Ferrari G, Esselens L, Hart ML, Janssens S, Kidner C, Mascarello M, Peñalba JV, Pezzini F, von
Rintelen T, Sonet G, Vangestel C, Virgilio M, Hollingsworth PM (2023) Developing the Protocol Infrastructure for
DNA Sequencing Natural History Collections. Biodiversity Data Journal 11: e102317.
https://doi.org/10.3897/BDJ.11.e102317
Abstract
Intentionally preserved biological material in natural history collections represents a vast
repository of biodiversity. Advances in laboratory and sequencing technologies have made
these specimens increasingly accessible for genomic analyses, offering a window into the
genetic past of species and often permitting access to information that can no longer be
sampled in the wild. Due to their age, preparation and storage conditions, DNA retrieved
from museum and herbarium specimens is often poor in yield, heavily fragmented and
biochemically modified. This not only poses methodological challenges in recovering
nucleotide sequences, but also makes such investigations susceptible to environmental
and laboratory contamination. In this paper, we review the practical challenges associated
with making the recovery of DNA sequence data from museum collections more routine.
We first review key operational principles and issues to address, to guide the decision-
making process and dialogue between researchers and curators about when and how to
© Ferrari G et al. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY
4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are
credited.
2 Ferrari G et al
sample museum specimens for genomic analyses. We then outline the range of steps that
can be taken to reduce the likelihood of contamination including laboratory set-ups,
workflows and working practices. We finish by presenting a series of case studies, each
focusing on protocol practicalities for the application of different mainstream methodologies
to museum specimens including: (i) shotgun sequencing of insect mitogenomes, (ii) whole
genome sequencing of insects, (iii) genome skimming to recover plant plastid genomes
from herbarium specimens, (iv) target capture of multi-locus nuclear sequences from
herbarium specimens, (v) RAD-sequencing of bird specimens and (vi) shotgun sequencing
of ancient bovid bone samples.
Keywords
Museomics, hDNA, biodiversity genomics, natural history collection sequencing
Introduction
Natural history collections as a resource for genomic science
There are more than one billion specimens representing ca. two million species stored in
natural history collections worldwide (Wheeler et al. 2012, Yeates et al. 2016, Johnson et
al. 2023). These collections span a wide geographical and temporal range and represent a
globally distributed biorepository. They house biological specimens representing the
world’s known species, along with many specimens representing undescribed species
awaiting taxonomic recognition and formal taxonomic descriptions (Bebber et al. 2010).
First and foremost, these natural history collections were established to support
understanding of species diversity and distributions (Miller et al. 2020) and the vast
majority of specimens housed in these repositories were collected to preserve their
appearance and morphological features; most specimens were not collected with DNA
sequencing in mind (Roycroft et al. 2022).
Until recently, the recovery of DNA sequences from museum specimens was challenging
and prone to very high rates of failure or requiring laborious protocols for successful
recovery of minimal quantities of nucleotide sequence data (Lalueza-Fox 2022). However,
with the development of improved sequencing technologies and protocols, there is now a
rapid surge of interest in the field of museomics (Card et al. 2021, Raxworthy and Smith
2021). Considerable attention is being given to unlocking DNA data at a large scale,
capitalising on the centuries of effort that have gone into the acquisition of biological
specimens for natural history collections (Hebert et al. 2013, Folk et al. 2021). At a very
practical level, natural history collections provide access to easy-to-retrieve and often well-
identified specimens. This contrasts with the considerable challenges and costs associated
with obtaining freshly-collected material for DNA analyses, such as field collecting costs,
the cost of preparing voucher specimens and the difficulties of accessing taxonomic
expertise to ensure accurate biosample identifications (Hebert et al. 2013, Camacho et al.
2018). These challenges are exacerbated for taxa occurring in remote and/or poorly-
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 3
studied locations (Wandeler et al. 2007) or areas that are difficult to access because of
political instability or conflict (Burrell et al. 2015). Furthermore, where taxa or populations
have been lost in the wild, natural history collections are often the only genetic resource for
extinct and endangered species (Wandeler et al. 2007, Clewing et al. 2022). Beyond these
practical benefits of sampling museum specimens for DNA, there are also the unique
scientific opportunities that come from being able to undertake time-series analyses
capitalising on the temporal component of natural history collections; DNA sequencing of
these collections can provide direct windows into evolutionary processes and patterns of
adaptation and evolutionary change (Holmes et al. 2016) and the trajectory of species of
conservation concern (Nakahama 2021, Jensen et al. 2022). Likewise, sequencing
specimens from natural history collections can also provide insights into the dynamics of
associated organisms, such as pathogens, parasites and other intimately-connected
species residing in or on museum specimens (Bieker et al. 2020, Ferrari et al. 2020,
Ristaino 2020, Raxworthy and Smith 2021, Speer et al. 2022).
The DNA within natural history museum specimens has distinct properties from the DNA in
freshly-collected material which has practical implications for recovering sequence data.
From a biochemical perspective, DNA isolated from natural history collection specimens
shares many similarities with ancient DNA (aDNA). Characteristically, aDNA is highly
fragmented and biochemically damaged, often present in small quantities and subject to
contamination from the environment and human handling. In the absence of the enzymatic
repair mechanisms of living cells, DNA is subject to hydrolysis, oxidation and cross-linking
(Dabney et al. 2013a), processes that can be accelerated by high temperatures, extreme
environmental pH, humidity and the presence of microorganisms (Willerslev and Cooper
4 Ferrari G et al
2005, Willerslev et al. 2007). Hydrolysis and oxidation can lead to depurination, which
results in DNA strand breakage (Lindahl 1993). As a consequence, aDNA is typically no
longer than 150 bp (Green et al. 2009).
These degradation processes for aDNA also occur to greater or lesser degrees in museum
specimens. Various studies of DNA degradation in natural history collections have shown
that DNA fragmentation can occur rapidly after death (Sawyer et al. 2012), with a wide
range of reported fragment lengths, including frequent reports of fragment lengths < 100 bp
(McDonough et al. 2018, Canales et al. 2022, Mullin et al. 2023). There is an imperfect
relationship between specimen age and levels of fragmentation, with some studies
showing a correlation between specimen age and fragment length (McCormack et al. 2016
, Weiß et al. 2016, Mullin et al. 2023), whereas others did not (Sawyer et al. 2012). The
factors affecting the rate of fragmentation are complex (Kistler et al. 2017) and include
differences between genomes (e.g. mtDNA sequences showing slower degradation than
nuclear sequences (Heintzman et al. 2014)), differences between tissue types (Kistler et al.
2017, Andreeva et al. 2022) and differences due to the environmental conditions at the site
of specimen preparation and different storage environments and preservation methods
(Brewer et al. 2019, Card et al. 2021, Mullin et al. 2023). For instance, museum storage
conditions may impact on biomolecule degradation, with temperature and humidity
influencing levels of preservation (Kistler et al. 2017, Brewer et al. 2019). Likewise different
preservation methods themselves impact on the preservation and recoverability of
nucleotide sequences: several preparation techniques involve heat, which accelerates
DNA hydrolysis resulting in fragmentation (Lindahl 1993, Willerslev and Cooper 2005).
Formalin fixation is a commonly-used technique for wet-mounted specimens and,
especially if unbuffered, can cause a number of reactions, including DNA fragmentation via
acid-driven hydrolysis and DNA-protein cross-linking that results in PCR inhibition (Brutlag
et al. 1969, Gilbert et al. 2007b). Treatment of bones with ammonium solutions and various
tanning agents has also been reported to reduce DNA yields from museum specimens
(Vuissoz et al. 2007, Tarbet Hust and Snow 2021). Finally, pest-control treatments of
collection specimens can also impact the recovery of DNA (Espeland et al. 2010, Töpfer et
al. 2011).
There are several consequences of this fragmentation and loss of DNA. The first is that
experimental effort may be expended which ultimately leads to a failure to recover DNA
sequence data due to low endogenous DNA content. The second is that DNA sequence
data may be recovered, but be misleading due to contamination. The potential for
contamination in sequence data from museum specimens is substantial. The low
concentrations of fragmented endogenous DNA in museum specimens represent an initial
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 5
low signal-to-noise ratio and a high potential for contamination from a wide variety of
sources, including:
2. High concentrations of DNA from fresh samples and their amplification products
processed in the same facility which represent an important source of
contamination when handling degraded DNA. Such contaminant DNA may be
present in higher concentrations than the DNA in historic samples and this is
exacerbated by subsequent PCR being biased towards higher-quality DNA;
At best, contamination reduces sequencing efficiency for endogenous DNA and requires
greater sequencing efforts at higher costs. What is more problematic is the generation of
erroneous data where misleading biological inferences are made from undetected
contamination (Yeates et al. 2016).
A more generic source of error, but one to which museum-derived sequences are
particularly susceptible, is problems stemming from low coverage of sequence reads due
to low DNA concentration. This can result in misleading inference; for example, failure to
recover both alleles in diploid heterozygotes leading to an overestimation of homozygosity
at some loci and in some specimens (Ewart et al. 2019) or, more generally, the introduction
of noise due to miscalls which may simply override any weakly-resolved genuine signal in
the data.
Above and beyond the challenges of recovering reliable sequence data from low
concentrations of fragmented endogenous DNA is the possibility of post-mortem
modifications leading to artefactual substitutions in the recovered DNA sequences. Post-
mortem hydrolytic deamination causes base modifications, primarily affecting cytosine
(Orlando et al. 2021). Uracil, the deamination product of cytosine, causes the
misincorporation of adenine during DNA amplification. This results in C to T substitutions in
the deaminated strand and G to A substitutions in the complementary strand of DNA
(Briggs et al. 2007, Brotherton et al. 2007). This deamination of cytosine resulting in C to T
and G to A substitutions during amplification can, if unchecked, lead to a systematic
misleading signal in the data. Sequences from different biological samples may share
nucleotide changes due to these miscoding lesions which may be misinterpreted as
genuine biological similarities. The accumulation of miscoding lesions at the end of DNA
molecules is a feature of aDNA (and used as an important parameter for the validation of
aDNA data authenticity (Briggs et al. 2007, Green et al. 2009)); however, they do tend to
be less common in studies focusing on natural history collections. DNA deamination
correlates with specimen age and there is an expectation for more recently collected
museum specimens to show limited impacts of deamination-related substitutions,
6 Ferrari G et al
Maximising the use of museum specimens, including for genomic analyses, while
minimising unnecessary destruction of precious samples thus reflects a balancing act. This
is made particularly difficult, as often the greatest scientific returns will come from the
specimens that are most valuable. For instance, type specimens will almost always have
significant constraints on their use, which may act as a barrier to inclusion in genomic
studies. On the other hand, effective minimally destructive sequencing of type specimens
provides a direct connection between genomic data and the application of a species name
and, hence, represents a significant scientific benefit, particularly for taxonomic and
systematic studies. There is a general point, that while sampling a museum specimen for
genomic analysis usually results in something being taken away from the specimen to
obtain DNA, it can also result in something extremely useful being added, in terms of
critically important genomic data which may add considerable value to the specimen (i.e.
the concept of the extended specimen; Webster (2017)).
Guidance on best practice standards and processes for the access and transfer of samples
for genomic analysis is given by the Consortium of European Taxonomic Facilities (2015)
and de Mestier et al. (2022). However, from a perspective of the impacts on the specimens
themselves, it is not surprising, given the rapidly-evolving state of the field and the
complexity of choices regarding different collections and different approaches, that there
are no community standards to guide when it is appropriate to destructively sample
specimens. There are various policy documents to guide decision-making at institutional
levels and several useful more general perspectives (e.g. Freedman et al. (2018), Austin et
al. (2019), Pálsdóttir et al. (2019)). To further facilitate the navigation of ‘when and how’ to
sample, we outline ten key principles which can usefully be followed by researchers and
assessed by curators in guiding when to undertake destructive sampling of specimens for
genomic analyses:
1. Assess the scientific merit of the planned genomic project; ensure there is a clear
likely benefit prior to commencing destructive sampling and that the resulting data
will be informative and of sufficient resolution to tackle the question at hand;
8 Ferrari G et al
The DNA in the majority of natural history museum specimens sits at the interface of aDNA
and non-degraded DNA samples and is classed as historic DNA (hDNA), more formally
defined as DNA from specimens archived in museum collections that were not originally
intended as genetic resources (Billerman and Walsh 2019, Raxworthy and Smith 2021,
Irestedt et al. 2022). This classification recognises that museum specimens, typically
collected over the last 250 years, have different properties to specimens recently collected
and preserved for DNA analyses and ancient specimens deposited in nature over millennia
(Raxworthy and Smith 2021, Wandeler et al. 2007). This distinction between ancient and
historic DNA is useful, although it should be noted that the line between archaeological
specimens, natural history collections and even biobank material is a blurred one. As noted
previously, several factors other than age influence DNA preservation, such as
temperature, substrate, taphonomic conditions and specimen preparation and storage.
Thus, a permafrost-preserved archaeological sample may be a better DNA source than a
heat-dried or chemically-treated museum voucher. Because of the lack of a priori
information regarding the magnitude of DNA degradation in historical collection material, a
pragmatic working assumption for hDNA material is to assume damage and fragmentation,
as well as environmental contamination (Latorre et al. 2020).
During the preparation of this paper, discussions amongst the authors and an informal
survey of colleagues working in a range of organisations involved in sequencing museum
specimens, revealed a wide range of operational practices. These ranged from processing
samples in the same laboratories as fresh tissue, through to dedicated hDNA (or low-copy)
laboratories, through to only ever using fully-equipped aDNA facilities for processing
museum specimens. A multitude of factors were articulated as underlying the decision-
making of which facilities to use for processing hDNA samples, including:
• When both aDNA and hDNA samples have to be processed in the same institution,
the aDNA laboratory being used exclusively for aDNA samples, with museum
specimens processed elsewhere due to concerns that the higher concentrations of
DNA from museum specimens may lead to contamination problems in the aDNA
lab.
The most critical component of setting up an aDNA laboratory (Gilbert et al. 2005, Fulton
2012, Knapp et al. 2012, Llamas et al. 2017) is a strict separation of pre- and post-PCR
areas. This includes no movement of equipment, reagents, consumables or samples from
post- to pre-PCR. Similarly, scientists should not move from the post- to pre-PCR without
showering and changing clothes. The dedicated aDNA pre-PCR facilities are physically
isolated from any post-PCR area and are used for sample processing, DNA isolation and
setting up of sequencing library and PCR reactions. These reactions are moved to post-
PCR facilities at the first DNA amplification step and post-amplification products can be
handled normally in shared laboratory facilities. Amplification products, fresh biological
material and modern DNA samples should never be introduced into the aDNA facilities.
Additionally, the dedicated aDNA laboratory should be fitted with a positive pressure,
HEPA-filtered air system and UV lights for daily sterilisation of all surfaces. Equipment,
tools and working surfaces should be cleaned daily or after every use with a 1-2% sodium
hypochlorite solution or a surface decontaminant such as DNA Away™ or DNA Exitus™.
Plastic consumables should be UV-sterilised and only filter tips should be used. Everything
that is introduced into the laboratory should be decontaminated. The aDNA laboratory
should be accessed through an antechamber where incoming reagents and consumables
are sterilised (and ideally introduced through a dedicated UV-hatch) and PPE donned
(overalls with hood, hairnets, facemasks, face shields, double gloves and shoe covers or
dedicated shoes). Destructive sampling and sample powdering should take place in a
separate room inside a PCR cabinet or dead-air box. Samples and DNA extracts should be
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 11
handled exclusively in HEPA-filtered laminar flow cabinets, equipped with UV lamps (and
should be cleaned and sterilised after every use). A separate cabinet for DNA-free
applications (aliquoting reagents, preparing master mixes) is needed. Additional good
working practices include processing samples in small batches and the inclusion of several
non-template negative controls for DNA isolation and library preparation, as well as dividing
reagents into smaller aliquots.
Natural history collection material can be handled in aDNA laboratories. However, aDNA
laboratories and their upkeep can be prohibitive in cost; therefore, institutions working
exclusively on natural history collections may choose a less stringent set-up for a
dedicated hDNA pre-PCR facility. This would usually be located in existing rooms rather
than a purpose-built laboratory. Not requiring a positive pressure air system and a
laboratory antechamber allows for more flexibility in the choice of location (e.g. repurposing
of laboratory spaces) and significantly reduces the costs. It should be noted that if an
hDNA facility is being established by repurposing an existing laboratory, thorough cleaning
with sodium hypochlorite of all surfaces is essential and new, dedicated equipment should
be bought. Similarly to aDNA facilities, all work should take place inside UV-fitted PCR
cabinets, and destructive sampling and sample powdering should take place in a separate
room or at least in a separate cabinet. Additional UV lamps for surface decontamination
may be fitted; however, repeated UV exposure is damaging to laboratory equipment,
increasing upkeep costs. Cleaning routines and good practices as described for aDNA
facilities should be implemented or adapted as best as possible, most importantly the
separation of pre- and post-PCR working areas. In Suppl. material 1, we outline a relatively
inexpensive (ca. €86K) and pragmatic equipment list for establishing a dedicated low-copy
facility for hDNA processing, that is flexible enough to be installed without extensive
building works.
Due to space or financial constraints or because of the need for higher throughput, it
remains the case that many institutions may decide against an aDNA or hDNA facility and
process natural history collection material in existing laboratories alongside fresh biological
material. Fresh material and especially its amplification products, represents a common
source of contamination for historic material. Thus, the separation of pre- and post-PCR
areas remains essential, although the routes to achieving this can be varied. At the very
least, thermocyclers should always be located in the post-PCR area and movement of
samples, reagents, consumables and equipment between pre- and post-PCR should be
avoided or limited as best as possible. Additionally, pre-PCR work on collection material
should be carried out in dedicated laminar flow hoods (to be cleaned and UV-sterilised
regularly) with dedicated tools and reagents.
12 Ferrari G et al
Ordering samples to avoid closely-related taxa being in adjacent wells: The most difficult
contamination to spot is from closely-related samples, as even detailed analysis and
comparisons with reference samples may not flag contaminants. Where there is a mixture
of closely- and more distantly-related taxa being processed, a simple option is to order
samples to maximise the likelihood of adjacent well contamination being detectable, by
minimising the presence of closely-related specimens in adjacent wells.
Negative controls: The inclusion of non-template negative controls at the DNA isolation and
library preparation/PCR step is essential for ensuring that the pre-PCR facilities and
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 13
reagents are sufficiently clean. Negative controls should also be taken through all post-
amplification steps, sequenced and included in the data analysis.
Sequencing strategies: Jumping PCR (Kircher et al. 2012) and index switching during
cluster generation on the Illumina sequencing platform (van der Valk et al. 2020) can result
in reads potentially being assigned to the wrong sample. This normally does not pose a
problem for libraries generated from high-quality DNA, but can introduce an artefactual
contamination into degraded DNA libraries. This can be overcome by unique dual-indexing
of libraries, a common practice in aDNA experiments. Using unique indices in both library
adapters allows the detection and removal of these chimeric PCR products. Unique dual-
indexing, if not repeated within the same laboratory, also allows monitoring of potential
cross-contamination between projects.
Data authentication and validation: In addition to checks against reference libraries, various
bioinformatic pipelines and workflows support the verification and authentication of
degraded DNA sequences. These include estimating contamination by screening for
unexpected results such as 'heterozygosity' of haploid genomes (mitogenomes, plastomes)
(Krause et al. 2010, Renaud et al. 2019). These approaches, based on deviations of
expected ploidy, require sufficiently high sequencing coverage, but do not necessitate any
a priori knowledge of the origin of the contamination (Peyrégne and Prüfer 2020). Likewise,
testing for the presence of sequencing artefacts due to post-mortem DNA damage (C to T
and G to A misincorporations) at the read ends can be undertaken using specific software
such as mapDamage2 (Jónsson et al. 2013) and PMDtools (Skoglund et al. 2014). This is
particularly relevant to older samples.
Table 1.
Selected papers outlining recent progress, breakthroughs and protocol developments that support
the more routine recovery of genomic data from museum specimens
Plants Dried most < 20 Shotgun Large-scale genome skimming Alsos et al.
herbarium yrs, 165 sequencing study of 2051 herbarium specimens (2020)
specimens samples > recovering plastome and rDNA
50 yrs; sequences including standard plant
oldest barcodes
153 yrs
Fungi Fungarium Less than Whole genome Generation of draft genome Dentinger et
specimens 20 sequencing assemblies possible and of value for al. (2016)
enhancing resolution of fungal
phylogeny
Birds Avian skins up to ca. Whole genome Step-by-step guide to workflow and Irestedt et
150 sequencing protocols, including steps taken to al. (2022)
minimise risks of contamination
Mammals Frozen liver 17 to 41 Whole genome Linked-read or “synthetic long-read” Colella et al.
(Cricetidae, tissue sequencing sequencing technologies provide a (2020)
rodents, deer cost-effective alternative solution to
mouse) assemble higher quality de novo
genomes from degraded tissue
samples
A general challenge for the effective recovery of endogenous DNA from museum
specimens is the frequent low complexity of libraries caused by PCR and cleaning steps
modifying the relative abundances of the original DNA fragments during library preparation
(Casbon et al. 2011). This leads to the formation of artefactual PCR duplicates that may
bias sequencing results, decrease final coverage and increase sequencing costs (Rochette
et al. 2023). PCR duplicates can be removed during bioinformatic analysis (Marx 2017);
however, given the low concentration and quality of DNA from museum specimens, it may
be valuable to enhance the library complexity prior to sequencing. To this end, amounts of
starting material can be increased where possible (Fu et al. 2018) and single-tube library
preparation methods can be used (e.g. Carøe et al. (2018), Kapp et al. (2021)). PCR
conditions can also be adapted, for example, by selecting polymerases that are suited to
copy degraded DNA templates with good fidelity and without a severe tendency to
preferentially amplify DNA templates that are shorter or with higher GC content (Aird et al.
2011, Dabney and Meyer 2012, Seguin-Orlando et al. 2015). Protocols for archival
specimens generally perform amplifications in several independent PCR reactions with
minimal numbers of cycles (van der Valk et al. 2021, Irestedt et al. 2022) and with
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 17
sequencing efforts proportional to library complexity (Daley and Smith 2014). Finally,
sequencing by synthesis with 50 to 150 cycles format and single-end mode is usually more
cost-effective with short degraded fragments (Raxworthy and Smith 2021); however,
choosing more cycles or a paired-end mode may be valuable if fragment length
distributions are used to filter out contaminants.
This potential has received increasing attention in recent years (Raxworthy and Smith
2021) and several studies have looked at the recovery of DNA with a specific focus on
DNA isolation protocol optimisation to increase recovery of sequence data and minimise
damage to specimens for NGS sequencing (e.g. Patzold et al. (2020), Korlević et al.
(2021)). In many instances, even a limited amount of mtDNA data, such as short DNA
18 Ferrari G et al
barcodes, will be sufficient for resolving taxonomic issues and characterising patterns of
species diversity in highly-diverse insect taxa (Yeo et al. 2020). Here, we evaluate the
performance of shotgun sequencing (low-coverage genome sequencing) of museum
specimens with a wide age range from three key insect taxa to explore the utility of a fast,
generic and inexpensive approach to obtain mtDNA data from natural history collections to
support DNA-based biodiversity inventories and biomonitoring.
strongly amongst samples. All samples with a completeness below 50% (7-33%, n = 8; Fig.
1c) were collected before 1950, while no sample collected after 1950 had less than 70%
completeness. The pattern for coverage was different, as both a relatively low coverage of
< 50 and a high coverage > 100 were found throughout the entire age range of samples
(Fig. 1d). An extremely high coverage of > 1000 was only found in Sarcophaga.
Figure 1.
Scatter plots of DNA and sequence recovery from pinned insect specimens by age and taxon
(blue - Eudonia/Scoparia, grey - Sarcophaga, orange - Xylocopa). Specimen age (in years) is
on the x axis in all panels. A Total DNA yield (ng). B Number of sequencing reads. C
Completeness of the mitogenomes (%). D Coverage (n).
(2012) and Kistler et al. (2017). The completeness and coverage of recovered
mitogenomes showed a relationship with sample age, albeit with high levels of variation.
Thus, although all samples with low completeness were older than 80 years, there was
substantial variation in the completeness of mitogenomes amongst the oldest samples
(collected < 1920).
Overall, no sample failed entirely as measured by obtaining reads from mtDNA. One
sample did not yield any reads at all after the booster PCR, but the same sample worked
(albeit with only ca. 81,000 reads) for the library prepared without additional PCR
amplification.
In contrast to age, variation in success rate does not seem to differ amongst taxa, if
variation amongst collecting dates is taken into account. Almost all samples of Xylocopa,
for example, were collected before 1940 and about 75% in 1912 or before, and the
apparently lower success rate for Xylocopa could be explained by age alone. Similarly, the
species and consequently specimens used in this study differed in body size, which was
not controlled for here and, as the total amount of DNA recovered is expected to be
dependent on tissue input, that value cannot be directly compared meaningfully between
specimens of different taxonomic groups.
Interestingly, to date, a limited number of studies have used shotgun sequencing for
museomics in insects and only one has targeted mtDNA in particular for taxonomy in a
specimen-based approach. All types of Australian prionine longhorn beetles were shotgun
sequenced by Jin et al. (2020), leading to a major revision of their taxonomy. The
methodological study by Timmermans et al. (2016) aimed to show the applicability of
shotgun sequencing in museum specimens, but by a metagenomic approach that
combined DNA extracts without individual sample indexing. Cong et al. (2021) used a
shotgun approach for taxonomic purposes in North American butterflies on a large number
of specimens, but targeted mainly nuclear genes and their study relied on creating a
reference genome from a modern sample first. Other shotgun sequencing studies have
aimed to place enigmatic taxa on the tree of life (Twort et al. 2021) or explore population
genetic structure (Cridland et al. 2018) or species conservation (Mikheyev et al. 2017) by
focussing on deeper sequencing of few (2-29) museum specimens. While these studies
focused on aspects of species biology, Mullin et al. (2023) focussed on > 100 specimens of
one species of bumble bee from the UK to study DNA preservation in museum specimens.
Their results also show that there is great potential in the use of museum specimens of
pinned insects for biodiversity genomics research.
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 21
For both low-coverage shotgun sequencing and target capture, there is a trade-off between
costs, time and sequencing success (measured by the completeness of the target
sequence(s)), which is likely to tilt towards bait capture when the aim is to sequence a
large number of closely-related samples. However, if the aim is to target a larger range of
taxa at the genus level and beyond for taxonomic purposes, such as DNA barcoding, then
shotgun sequencing has the edge in our opinion due to its relative ease and the
universality of approach. As sequencing costs are still on a downward trajectory, the cost
balance is likely to be tilted further in its favour in the future.
• Older samples will often require more sequencing effort to obtain the same amount
of data as more recent specimens. If the main aim is the generation of DNA
barcodes for taxonomic purposes, this should not be overly relevant in practice;
Data availability: The raw sequencing data from this case study has been deposited at the
European Nucleotide Archive (ENA) under project PRJEB59182 with accession IDs
ERS14475133 - ERS14475206.
The performance of commercial DNA extraction kits (see below) was compared in a pilot
study targeting the Royal Museum of Central Africa (RMCA) collections of “true” fruit flies
(Tephritidae, Diptera) and African hoverflies (Syrphidae, Diptera). We selected three to six
specimens from seven sample groups dating from 2008 to 2016. These included three
Tephritidae (Zeugodacus cucurbitae Coquillett, Bactrocera dorsalis Hendel and Dacus
bivittatus Bigot) and two Syrphidae (Eumerus sp. and Ischiodon aegyptius Wiedemann)
species; all specimens were stored in 100% ethanol at -20°C, except Ischiodon aegyptius
which was pinned and preserved at room temperature. Digestions in lysis buffers were
implemented on whole bodies for all specimens. For comparative purposes, we also
processed forelegs of a separate set of specimens. The lysates obtained from each
specimen were divided into four aliquots and the DNA purified using spin columns from the
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 23
DNA extraction kits listed in Table 2 following the manufacturers' instructions. The
experimental design was based on 30 whole specimens and 18 legs (two negative controls
were also included); these samples were processed through 200 spin columns from four
different extraction kits. The concentration of each DNA extract was measured using a
Qubit 3 fluorometer (HS DNA Kit, Thermo Fisher Scientific) and the total amount of DNA
was inferred from the final elution volume, which in all cases was 100 µl.
Table 2.
An overview of the DNA extraction kits tested on fruit flies and hoverflies.
QIAGEN kit (50 Spin column Range of DNA fragment sizes Expected DNA yield (according
samples) (according to manufacturer’s to manufacturer’s instructions)
instructions)
To assess the relationship between WGS performance and (a) voucher age and
preservation and (b) DNA quality and quantity, we targeted a total of 732 insect vouchers
archived in the collections of RMCA collected from 1997 to 2022 (Fig. 2) and preserved
either in ethanol at -20°C (n = 651), pinned at room temperature (n = 14) or dried DNA
stored at room temperature (n = 67). All DNA extractions were performed using the
DNeasy Blood and Tissue Kit (Qiagen catalogue nr. 0148945380). We quantified the
amount of DNA extracted as measured by a Qubit 4 fluorometer (HS DNA Kit, Thermo
Fisher Scientific) and the quality of DNA via DNA fragment size distributions as measured
using the DNF-930 dsDNA Reagent Kit (75 bp – 20000 bp) on the fragment analyser of the
Genomics Core (Leuven, Belgium). These vouchers were considered as suboptimal for
genomic analyses as preliminary pilot tests (results not shown) revealed generally lower
concentrations of DNA and higher levels of DNA fragmentation compared to freshly-
processed specimens.
Based on DNA concentrations (above or below 7.0 ng/µl) and DNA fragmentation (> 350
bp or highly fragmented defined as < 350 bp), samples were submitted to Berry Genomics
(n = 563) for standard library preparation or to Novogene (n = 81) for low input DNA library
preparation, respectively. All samples were sequenced at 10x coverage on an Illumina
NovaSeq platform (150 PE reads, 6 Gb raw data output/sample). Quality parameters of
extracted DNA and WGS data of specimens originating from five insect genera and more
than 70 different species were collected (see Table 3 and Suppl. material 2).
24 Ferrari G et al
Figure 2.
Age distribution of processed specimens of fruit flies and hoverflies.
Table 3.
Number of processed collection vouchers from three Tephritidae and two Syrphidae genera.
Tephritidae Bactrocera 16
Syrphidae Eristalinus 83
Syrphidae Melanostoma 25
Our results show a general trend of decreasing recovery of DNA from older specimens
compared to younger specimens (Fig. 4a). In contrast, our assessment of DNA quality as
estimated by fragment lengths of the DNA extracts and Phred score (Q > 30) of raw
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 25
sequences lacks any clear temporal trend, with degraded DNA with short fragment lengths
and low quality reads recovered across the range of specimen ages (Fig. 4b and c).
Figure 3.
Boxplots of DNA yields from replicated elutions of Tephritidae and Syrphidae of A whole body
digestions and B leg digestions per DNA extraction method (Qiagen, DNeasy Blood and
Tissue Kit, cat. 69506; QIAamp DNA Mini, cat. 51304; QIAamp DNA Micro, cat. 56304;
MinElute PCR Purification, cat. 28004).
Figure 4.
Boxplots per collection year for Tephritidae and Syrphidae specimens extracted with the
DNeasy Blood and Tissue Kit (Qiagen): A DNA quantities (calculated from concentration
measured with Qubit 4.0); B the proportion of DNA fragments between 35 and 350 bp
(measured with Fragment Analyzer (DNF-930 dsDNA Reagent kit)); C proportion of
sequenced reads with Q > 30.
26 Ferrari G et al
Sub-optimal or low-quality DNA from museum specimens is often not directly suitable for
genetic/genomic analyses (Besnard et al. 2016). However, our results suggest that
standard DNA extraction, based on commercially available kits followed by WGS at 10x
genome coverage, represents a cost/time-effective, pragmatic approach to the routine,
large-scale genotyping of insect vouchers collected over the past two decades. The
majority of samples processed in this analysis were of material stored in ethanol. The
samples that were pinned at room temperature (n = 14) or stored as dried DNA at room
temperature (n = 67) showed similar DNA quantity and quality results as the DNA from
specimens stored in ethanol.
The DNA of these diverse and heterogeneously collected samples, even if generally
suboptimal in terms of concentration, fragmentation and contamination, still allowed
recovery of substantial amounts of quality reads (Q > 30) of potential use for genomic
research. This general approach needs to be complemented with more specialised and
time/cost-demanding procedures for highly-degraded DNA from older specimens. A two-
step approach, including the use of commercial kits and methods outlined here allows for
rapid screening of younger specimens and reserving the more intensive protocols (also
including aDNA methodologies) for older specimens represents a pragmatic, cost-effective
route to the routine genotyping of our insect collections.
• The DNeasy Blood and Tissue Kit (Qiagen) provided a cost-effective method of
extracting DNA from museum specimens aged one to 25 years;
• For older material, the use of low input library preparation for highly-fragmented
and low concentration DNA extracts is recommended.
Data availability: The data and meta-data from this case study are documented in Suppl.
material 2.
genetic reference databases, especially in areas where field expeditions are not feasible
anymore due to political instability or increased inaccessibility. Here, we demonstrate the
usefulness of genome skimming by shotgun sequencing to mine herbarium specimens for
the assembly of their plastomes to support DNA-based identification of trade timber
species.
Figure 5.
Transportation of large tree trunks from the forest to enter the international timber trade.
The quality and quantity of DNA in herbarium specimens is strongly reliant on collection
and storage conditions and, in general, herbarium DNA can be highly fragmented (< 150
bp) and only available in very low amounts (< 5 ng/µl). Interestingly, the techniques
optimised for historical herbarium specimens can also be applied to heartwood specimens
(i.e. old degenerated material) of processed wood. By jointly analysing herbarium material
and silica-dried leaf samples, a clear comparison can be made on the feasibility of
historical material for genome skimming purposes, with the aim of yielding full plastid
genomes of selected species that are under strong pressure due to illegal logging activities
in Central Africa.
Tree species were selected, based on the following criteria: providing highly valuable
timber, becoming potentially important for national and international timber trade or for
being reported in agreements on global biodiversity conservation (e.g. CITES, IUCN). In
the case of herbarium samples, there was a careful selection based on prior knowledge
about the specimens. We avoided material that was likely to have been: (a) dried with
alcohol, (b) treated with conservatives posterior to collection (e.g. HgCl2) or (c) collected in
remote areas where it was difficult to properly dry the specimens in the field. When
sampling from herbarium specimens, we aimed to: (d) collect the greenest leaf tissue, (e)
avoid the central leaf vein and (f) avoid leaves with potential markings of insect herbivory.
Total genomic DNA of both silica and herbarium material was extracted using a combined
and modified version of the CTAB protocol (Doyle and Doyle 1987) in which an initial
washing step with 0.35 M d-sorbitol was included. The lysis buffer contained 2% CTAB and
2% PVP-40. A chloroform-isoamyl alcohol (24/1 v/v) extraction was carried out twice. After
a cold isopropanol precipitation and subsequent centrifugation, the pellet was washed with
70% ethanol and air-dried. The DNA pellet was eluted with 1X TE buffer. All herbarium
specimen DNA extractions were carried out under a laminar flow hood, in which positive air
pressure and UV disinfection was present.
The purity of the resulting DNA was measured under the absorbance ratios OD 260/280
and OD 260/230 using a NanoDrop 2000 (Thermo Fisher Scientific, US). DNA
concentration (ng/μl) and fragment size distribution were measured by capillary
electrophoresis using a Fragment Analyzer (Agilent, US). Library preparation of the silica-
dried leaf material was initiated via an enzymatic DNA fragmentation step with the aim to
retain DNA fragments with a size between 200 and 450 bp after which an end repair step
took place. This step was conducted using the NEBnext UltraTM II FS DNA Library Prep
Kit for Illumina (New England Biolabs, US). Due to the presence of already degraded DNA
in the herbarium specimens, the enzymatic DNA fragmentation step was not carried out for
the herbarium material. Adapter ligation was conducted using the NEB-next Adaptor kit for
Illumina and U-excision was carried out using the USER Enzyme kit (New England
Biolabs, US). Size selection (320–470 bp) was conducted following the SPRIselect
protocol (Beckman Coulter, US). With the NEBNext Ultra II Q5 Master Mix, adaptor-ligated
DNA was indexed, then PCR-enriched with the NEBNext Multiplex Oligos for Illumina (New
England Biolabs, US). For the latter, the following thermocycler reactions were used: Initial
denaturation at 98°C for 30 s, 3–4 cycles of denaturation at 98°C, each for 10 s as well as
an annealing/extension at 65°C for 75 s and a final extension phase at 65°C for 5 min. In
the last step of the library preparation, a DNA-library purification was conducted using
SPRIselect (Beckman Coulter, US). The final fragment size distribution and molarity (nM)
were examined with a Fragment Analyzer (Agilent, US). Indexed libraries were
subsequently pooled (on average 25 samples per lane) in equimolar ratios. Sequencing of
the DNA libraries (low coverage paired-end; 10X, 150 bp) was done on a HiSeq 3000, a
HiSeq 4000 and NovaSeq 6000 (Illumina, US). At the time of analysis, between autumn
2019 and spring 2021, library preparation and sequencing costs were estimated at a total
of €45 - 50 per sample.
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 29
Data analysis
The quality of the raw reads was evaluated with FastQC (Andrews 2010). Using the
GetOrganelle pipeline, plastomes were de-novo assembled (Jin et al. 2020) under k-mer
values set as “-k 21, 45, 65, 85, 105” for all species and 15 extension rounds. The pipeline
was initiated by recruiting targeted plastid-like reads as applied in Bowtie 2 (Langmead and
Salzberg 2012). During the assembly process, reads were trimmed and contigs
reconstructed with SPAdes 3.13 (Bankevich et al. 2012). In addition, plastid-like contigs
were filtered by comparing them against the BLAST nucleotide database following the
NCBI Blast+ tool (Camacho et al. 2009). In the next step, reconstructed plastomes were
aligned against a reference genome with MAFFT v.7 (Katoh et al. 2002), thereby aiming for
the most closely-related taxon for comparison that could be found on GenBank. In the case
of unsuccessfully assembled plastome regions using the GetOrganelle pipeline, raw reads
were then mapped to target regions of closely-related species using Bowtie 2 (Guibourtia
pellegriniana and G. tessmannii). Applying the web-based software CpGAVAS2 (Shi et al.
2019), annotation was conducted, after which the annotation results were endorsed with
Geneious Prime (Kearse et al. 2012) by comparing them with a reference plastome derived
from GenBank.
The results obtained in this case study corroborate those of some recently-published
studies on the use of genome skimming for the retrieval of full plastomes of land plants
(Bakker et al. 2016, Zeng et al. 2018, Alsos et al. 2020, Nevill et al. 2020). Each of those
studies indicates the scalable potential of genome skimming to obtain plastome sequences
from herbarium specimens. Using this approach, a high level of success has been
achieved across a range of ages of herbarium specimens (Alsos et al. 2020, Nevill et al.
2020) and even with small amounts of tissue, an effective plastome assembly can be
generated (Bakker et al. 2016). This collective body of studies shows that genome
skimming represents an inexpensive, pragmatic approach for recovery of plastome
sequences that can be applied to small amounts of herbarium material.
• Genome skimming of herbarium specimens has shown high success rates across
multiple independent studies;
• Despite the often lower number of reads retrieved from herbarium specimens
compared to fresh tissue, it is becoming increasingly routine to generate complete
(or almost complete) plastomes from herbarium material using genome skimming;
• Since one of the most important steps in the genome skimming protocol is to
downsize the DNA fragment length, the often highly-degraded DNA of herbarium
specimens allows the sonication step to be bypassed in the library preparation
protocol.
Data availability: The data in this case study are available under the following GenBank
numbers: MZ274087, MZ274092, MZ274094-MZ274096, MZ274099, MZ274102-
MZ274107, MZ274110, MZ274113, MZ274116-MZ274122, MZ274124, MZ274127-
MZ274129, MZ274132, MZ274135, MZ274137, MZ274143, MZ274145, MZ274147,
MZ274148 (see Mascarello et al. (2021)).
The preservation and quality of DNA in herbarium material are highly variable. It has been
suggested that DNA decays at a faster rate in plant remains compared to animals (Allentoft
et al. 2012, Weiß et al. 2016) and the techniques used for herbarium sheet preparation and
the storage conditions of specimens have been shown to affect DNA recovery and
fragmentation rates (Särkinen et al. 2012, Forrest et al. 2019). Studies comparing recently
prepared and older herbarium specimens do not reach a consensus on whether DNA
fragmentation and damage happen mainly at specimen preparation (e.g. Staats et al.
(2011)) or accumulate gradually over time (e.g. Weiß et al. (2016)). These discrepancies
are likely the result of different preparation techniques, ranging from gentle drying in non-
acidic paper to high heat and chemical treatments. For this reason, for herbarium
specimens, it is common to see both laboratory protocols aimed at recovering and
sequencing low-concentration, fragmented DNA (e.g. Latorre et al. (2020)) and protocols
commonly used for higher-quality DNA sources (e.g. Forrest et al. (2019)).
In this study, we sampled specimens from the herbarium at the Royal Botanic Garden
Edinburgh (RBGE) that were collected 12-50 years ago from cultivated individuals of
Rhododendron javanicum. These cultivated individuals are still present as live plants in the
living collection at RBGE and allowed us to assess the reliability of sequences recovered
from herbarium material compared to freshly-collected samples from the same genetic
individuals. The chosen sequencing approach was hybridisation capture (also known as
target capture or target DNA enrichment) which is an effective sequencing approach for
studies utilising degraded DNA sources because it enables recovery of sequence data
from low concentrations of endogenous DNA (Carpenter et al. 2013).
DNA from herbarium specimens was extracted as described in Latorre et al. (2020), using
the Basic Protocol 1, following standard anti-contamination precautions (Gilbert et al. 2005,
Llamas et al. 2017), including parallel non-template controls. DNA fragment size
distribution for extracts was inspected with the gDNA Kit on the Agilent Femto Pulse
System. Sequencing library preparation protocol was selected based on DNA fragment
size (Table 4).
32 Ferrari G et al
Table 4.
Details of Rhododendron herbarium samples used in this study including collection and accession
numbers, as well as library protocols used. RHD002 and RHD007 herbarium specimens relate to
the same single individual in the living collection, as do RHD016 and RHD018, respectively. Two
samples (RHD011 and RHD018) had sequencing libraries prepared using two different protocols.
Fresh samples from the living collection were also collected for all individuals. DNA fragment size
distribution: size as stated, except bimodal which means one peak of < 1000 bp and one peak of
approximately 1-20 kbp. ssDNA = single-stranded DNA library, NEB = NEBNext Ultra II library with
sonicated DNA. *All sequencing libraries for the living collection were prepared using NEBNext
Ultra II kits with sonicated DNA.
RHD018 R. javanicum javanicum E01016322 1972 19680840 < 500 bp ssDNA, NEB
(+tail)
DNA extracts with fragments shorter than 500 bp (n = 5) were converted into single-
stranded DNA (ssDNA) libraries following Kapp et al. (2021) with tier four adapter dilutions
and unique dual indexes. All steps up to indexing PCR reactions were carried out in
dedicated ancient DNA facilities at the University of Oslo, Norway. Sequencing library
quality and concentration were assessed by qPCR following Meyer and Kircher (2010) and
with the Ultra Sensitivity NGS Kit on the Agilent Femto Pulse System. Libraries were then
re-amplified in three 25 μl reactions with Herculase II Fusion DNA polymerase (Agilent).
One sample (RHD011) with longer DNA fragments was also included in this library
preparation batch.
short fragments, but with a tail of longer fragments. Aliquots of these extracts were
subjected to 8-12 sonication cycles of 30 s on, 90 s off, using a Diagenode Bioruptor
sonicator, for a target fragment size of 200-400 bp. Libraries were generated with the
NEBNext® Ultra™ II DNA Library Prep Kit for Illumina (New England Biolabs) and indexed
with NEBNext® Multiplex Oligos for Illumina® (Unique Dual Index Primer Pairs). Because
these libraries were produced in non-dedicated facilities where fresh plant material is
regularly processed — posing a risk for contamination — the following anti-contamination
precautions were taken: pre-amplification steps were carried out inside a dedicated laminar
flow hood in a pre-PCR room with dedicated reagents and consumables and negative non-
template controls were included.
Approximately 150 mg of leaf material was harvested into 7.6 ml FluidX tubes and placed
immediately into liquid nitrogen. DNA was extracted using a protocol developed for
extracting high molecular weight DNA (Nishii et al. 2023). This protocol, which includes a
sorbitol wash prior to using the Qiagen Genomic Tip Kit, was used due to the high quantity
of secondary metabolic compounds present in Rhododendron. The DNA extracts were
sonicated for 7-11 cycles of 30 s on, 90 s off, using a Diagenode Bioruptor sonicator, for a
target fragment size of 200-400 bp. Libraries were generated with the NEBNext® Ultra™ II
DNA Library Prep Kit for Illumina (New England Biolabs) and indexed with NEBNext®
Multiplex Oligos for Illumina® (Unique Dual Index Primer Pairs). These libraries were
generated in non-dedicated facilities with pre-amplification steps carried out inside a
dedicated laminar flow hood with dedicated reagents and consumables.
Hybridisation capture was performed on all libraries. The assay was designed using two
published Rhododendron genomes from NCBI: R. delavayi ( Zhang et al. 2017) and R.
williamsianum ( Soza et al. 2019) and a transcriptome from the mature leaf of R.
scopulorum from the 1000 Plants (1KP) project (Matasci et al. 2014). The bait set contains
492 target loci, including 298 loci orthologous to the Angiosperm353 universal probe set
(Johnson et al. 2019). The remaining 194 loci were picked from genes related to cold
tolerance, flowering pathway, key developmental regulators of meristem function, organ
development and trichome development. Baits were synthesised by MyBaits (Arbor
Biosciences) with 3X bait tiling to be optimal for degraded DNA.
Libraries were pooled according to material and library construction protocol. The samples
were processed with a wider set of samples than are presented here, such that each pool
contained 10-14 libraries. Negative controls were pooled separately. Hybridisation capture
was performed following the MyBaits (Arbor Biosciences) protocol v.5.02 with the high
sensitivity version and the hybridisation and wash temperatures set to 63°C for herbarium
samples (the second round of enrichment was omitted) and with the standard version and
hybridisation and wash temperatures set to 65°C for living samples. Pools were re-
amplified post-capture in two 50 μl reactions with Herculase II Fusion DNA polymerase
(Agilent) for 14 cycles. Captured libraries for living and herbarium samples were
34 Ferrari G et al
sequenced on separate Illumina MiSeq lanes with no index repetition, with 150 bp PE v.2
runs (4.5 - 5.0 Gb) at the University of Exeter sequencing facilities. The target region of the
bait set is 621,078 bp, which translates to an average coverage of 720X/sample for the 10
living samples and 600X for the 12 herbarium samples.
Data analysis
Herbarium reads were processed with the PALEOMIX v.1.3.7 BAM pipeline (Schubert et al.
2014). Paired-end reads were trimmed, filtered, and collapsed with AdapterRemoval v.
2.3.3 (Lindgreen 2012), discarding reads shorter than 25 bp. Collapsed reads were aligned
to the target loci used for probe design with BWA v.0.7.17 (Li and Durbin 2009), using the
backtrack algorithm with disabled seeding and a minimum quality score of 25. mapDamage
v.2.2.1 (Jónsson et al. 2013) was used to assess aDNA deamination patterns and rescale
BAM file quality scores. Living collection reads were processed as described for herbarium
reads without read collapsing and retaining reads longer than 50 bp. The BWA MEM
algorithm was used for read alignments to the same references.
Quantity and quality of the SNPs called for the herbarium samples were assessed by
comparison to the sequence from their respective paired living sample. First, a new
reference for each individual was generated using sequence data from only the living
sample of that individual. BAM files from the initial run of PALEOMIX (above) were filtered
using strict settings on bcftools v.1.16 (filter SNPs by QUAL > 160 and DP > 10) and
consensus fasta files were generated to be used as a new reference (Forrest et al. 2019).
The new reference was used to run the PALEOMIX BAM pipeline for a second round for
the same living and their respective herbarium sample pairs, this time using the individual
new references rather than the original target sequences. New VCF files were generated
from the output BAM files and bcftools stats was used to compare SNPs called from the
living and from the herbarium material. We identified shared SNPs in living and herbarium
samples, but not present in the new reference (likely heterozygous sites) and those
exclusive to the herbarium samples (likely erroneous SNPs). The code used to analyse the
data and make figures is available at: https://github.com/rbgedinburgh/dna_seq
uencing_herbaria.
Without any prior assumption of DNA fragmentation rates in the herbarium samples
processed in this study, our approach consisted of isolating DNA in dedicated clean
facilities. Following an assessment of DNA fragment size, we decided to separate samples
into two categories. Fragmented DNA extracts were treated, following aDNA protocols in
dedicated facilities, whereas extracts containing longer DNA fragments were taken to non-
dedicated facilities for DNA shearing and library preparation using commercially available
kits. We assessed coverage of targeted loci (Fig. 6A) and library complexity —using read
clonality — (Fig. 6B) by mapping reads to the target loci used for probe design. As
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 35
expected, we obtained higher coverage of targeted loci from freshly-collected samples for
similar amounts of sequencing effort. For herbarium samples, libraries generated from
sheared DNA using the NEB kit had higher complexity and higher coverage of targeted loci
than ssDNA libraries that were generated from highly-degraded DNA. This is to be
expected given the difference in quality of input DNA. Detailed mapping statistics are
available in Suppl. material 5.
Figure 6.
Sequencing coverage of targeted loci and library complexity for Rhododendron specimens. A
Coverage of targeted nuclear loci. B Proportion of PCR duplicates. LIV = libraries generated
from living collection samples, ssDNA = single-stranded DNA libraries made from degraded
herbarium DNA, NEB = double-stranded DNA libraries made from sheared herbarium DNA
using a commercial kit. C DNA deamination patterns of read data obtained from NEB (red,
blue) and ssDNA (dark red, dark blue) herbarium libraries with mapDamage v.2.2.1. First base
was removed for visualisation.
We did not detect any amplification products in the negative controls of libraries generated
in the non-dedicated facilities following anti-contamination measures, indicating that it was
possible to process herbarium DNA extracts of sufficient DNA concentration and fragment
size under these conditions and with the necessary precautions. In addition to the ssDNA
library construction protocol (Kapp et al. 2021), we also tested a dsDNA protocol optimised
for aDNA (Meyer and Kircher 2010, Kircher et al. 2012) (data not shown), but we observed
high levels of sequence clonality, possibly caused by PCR inhibition. DNA isolation and
sequencing library preparation for plant material can be complicated by secondary
compounds, such as polysaccharides and polyphenols that can bind to and co-precipitate
with DNA, resulting in PCR-inhibition (Souza et al. 2012). Rhododendron is rich in
secondary metabolic compounds (which also led to difficulties in extracting DNA from fresh
material) and it is possible that the initial DNA denaturation step in the ssDNA library
36 Ferrari G et al
preparation protocol had a beneficial effect on breaking crosslinks between DNA and
secondary compounds (compared to the dsDNA protocol). We only tested a small number
of samples, but the efficacy of this comparatively fast and cheap ssDNA protocol is
promising and further testing on short degraded DNA isolated from herbarium material
would be worthwhile.
Finally, we observed mild deamination patterns in reads recovered from herbarium material
(Fig. 6C) compatible with historic DNA damage (Jónsson et al. 2013), although the
magnitude of this was very small (ca. 3% first base misincorporation) compared to levels
often observed in older material. Interestingly, we observed similar deamination patterns in
libraries generated with the ssDNA protocol and the NEB kit, indicating that despite DNA
shearing, and the NEB library preparation including a USER enzyme hairpin loop adaptor
cleavage step, enough base deaminations at DNA overhangs were retained to show
evidence of post-mortem DNA damage (Jónsson et al. 2013).
We took advantage of cultivated plants present in the RBGE living collection, from which
herbarium vouchers were created 12-50 years ago, to investigate whether sequences
recovered from the herbarium samples were an accurate biological replicate of the living
material, or if low starting templates and base modifications, both features that accumulate
in degrading DNA over time, resulted in erroneously-called bases (Briggs et al. 2007).
Using only sequences recovered from living material, we assembled a strict consensus
sequence for each individual. These were used as a new reference for mapping and SNP
calling. We assigned SNPs as being exclusive to living samples, exclusive to herbarium
samples or shared between a living-herbarium pair of the same individual. SNPs exclusive
to living samples might be caused by ambiguous calls at heterozygous sites, while SNPs
exclusive to herbarium samples can be interpreted as erroneous SNPs, likely due to low
SNP quality, low coverage or base modifications in degraded DNA. In contrast, shared
SNPs between living and herbarium tissue can be interpreted as true.
We typically observed 75-108 SNPs per individual, of which 45-87 were shared between
living and herbarium samples (Fig. 7A) and 12-36 that were found in herbarium specimens
only. We did not observe a correlation between specimen age and proportion of these
likely erroneous SNPs. We also inspected the quality and sequencing depth of SNPs and
found that SNPs exclusive to herbarium samples were of much lower quality than those
present in both herbarium and living material (Fig. 7B and C). The quality and depth of
these erroneous SNPs unique to herbarium specimens are, however, above standard
SNP-filtering thresholds. In our study, the use of more stringent filtering criteria for
herbarium SNPs is required to give a better representation of ‘true’ sequence variants (e.g.
those also recovered from non-degraded tissues).
observed spike in SNP abundance is due to these specimens having higher levels of
heterozygosity due to hybridity. With a greater number of variable sites, there is an
associated increased possibility of detecting both genuine (shared) SNPs with respect to
the reference, as well as a corresponding increase in erroneous SNPs due to poor
coverage of these sites in herbarium material.
Figure 7.
Comparison of SNPs recovered from herbarium and living collection samples of the same
individuals of Rhododendron species. A Number of SNPs called, categorised into exclusive to
living samples (light blue, likely caused by ambiguous calls at heterozygous sites), exclusive to
herbarium samples (dark blue, likely caused by sequencing errors due to degraded DNA) or
shared (yellow). B Depth and C quality of shared and herbarium-exclusive SNPs. ssDNA =
single-stranded DNA libraries made from degraded herbarium DNA, NEB = double-stranded
DNA libraries made from sheared herbarium DNA using a commercial kit.
Multiple studies have now been undertaken exploring the potential of hybridisation capture
for the recovery of sequence data from herbarium specimens. These have included
exploratory studies assessing the feasibility of the approach for recovering sequence data
from plant specimens with a range of different ages (Hart et al. 2016), studies evaluating
the impacts of different treatment methods on sequencing success (Brewer et al. 2019,
Forrest et al. 2019) and those exploring the practicalities of scaling hybridisation capture in
plants, including how the characteristics of specimen origin and condition influence
sequence recovery (Folk et al. 2021, Kates et al. 2021). Collectively, these and other
studies have provided clear evidence that the recovery of large amounts of nuclear
sequence data is feasible for herbarium specimens with a wide range of ages and across
different taxonomic groups. In the current case study, we have shown that erroneous base-
calls can be made due to low starting templates and modified bases in herbarium
specimens. However, with stringent filtering for quality and depth, these erroneous SNPs
can be excluded, such that the remaining SNPs represent a more accurate reflection of the
individual’s genotype.
38 Ferrari G et al
• Studies recovering DNA from herbarium specimens should take place in dedicated
clean or low-copy facilities. Once DNA fragment length distribution is known,
sequencing library preparation can take place according to DNA size;
• DNA extracts that show a bimodal fragment size distribution with the majority of
fragments > 1 kb can be sheared, prior to library preparation with a commercially-
available kit. If this takes place in non-dedicated facilities, we recommend the
following contamination-limiting precautions:
◦ Dedicated laminar flow hood for all pre-PCR steps (to be regularly
decontaminated);
Data availability:
The raw sequencing data has been deposited at the European Nucleotide Archive (ENA)
under project PRJEB61704 with accession IDs ERS15567903 - ERS15567924.
proved very effective when working with high-quality DNA (Nadeau et al. 2014, Van
Belleghem et al. 2018), its implementation in museum studies has been hampered by the
unpredictable outcomes due to DNA degradation of museum specimens (Graham et al.
2015, Souza et al. 2017, Lang et al. 2020). DNA degradation at restriction sites causes
failure or bias in RRS due to inefficient or failed restriction digests, while random shearing
lowers the number of fragments being flanked by both restriction sites and therefore
prevents adapter ligation (Puritz et al. 2014, Graham et al. 2015). A study on artificially-
induced DNA degradation illustrated a significant decrease in the number of RADtags per
individual, the number of variable sites and the percentage of identical RADtags retained
(Graham et al. 2015). These difficulties have dissuaded scientists from using RRS as a tool
to obtain museum population-level data. However, when large collections are available, a
careful screening assessment prior to library preparation could aid in the selection of
samples that are most likely to yield successful results. Therefore, we here assess: (i) to
what extent DNA degradation affects the success rate of RRS in a long-term time series of
avian museum specimens and (ii) whether we can predict a priori the success rate of RRS
on museum samples using easy-to-obtain DNA quality metrics.
We sampled 96 barn owls (Tyto alba alba) comprising both historical as well as
contemporary specimens. Historical samples were obtained from collections stored at the
Royal Belgian Institute of Natural Sciences and covered two distinct periods in time, mainly
from the 1930s (1929-1943, n = 15) and mainly from the 1970s (1966-1979, n = 22).
Contemporary specimens (n = 59) comprised road kills which were brought to bird
sanctuaries and stored in freezers immediately upon arrival. We collected toe pads of all
historical specimens to minimise voucher damage and liver or breast muscle tissue of the
contemporary specimens.
DNA of all specimens was extracted using the NucleoSpin Tissue Kit (Macherey-Nagel
GmbH). Concentrations were quantified by the Qubit fluorometer (Invitrogen) and a
fragment analysis of historical samples was conducted on a 2100 Bioanalyzer (Agilent).
While numerous variations on reduced-representation genome sequencing exist (Puritz et
al. 2014), we here focused on double-digest restriction site-associated DNA sequencing
(ddRAD) because of the simplified wet-lab workflow, low cost and highly-homogenous
coverage of sites across samples (Peterson et al. 2012). ddRAD libraries were constructed
following the protocol of Peterson et al. (2012). Briefly, we digested DNA samples using
two restriction enzymes, i.e. SbfI and MseI. Starting volumes of DNA were adjusted
according to sample specific DNA concentrations (18 µl, 12 µl or 6 µl of DNA when
concentrations were respectively lower than 20 ng/µl, between 20 and 32 ng/µl or higher
than 32 ng/µl). Barcoded SbfI and universal MseI-compatible adapters were subsequently
ligated to the digested genome, followed by a size selection of fragments of 270 bp
(“narrow peak” setting) on a BluePippin (Sage Science). Lastly, fragments were PCR
40 Ferrari G et al
amplified using a barcoded reverse primer to obtain dual-indexed ddRAD libraries, which
were subsequently paired-end sequenced on an Illumina Novaseq6000 platform. Raw data
were demultiplexed using the process_radtags module in Stacks v.2.50 (Catchen et al.
2011). Trimmomatic v.0.39 (Bolger et al. 2014) was used to remove adapters and a sliding
window approach was applied to trim reads when quality fell below 20. Paired reads were
mapped to a reference genome (GCA_000687205.1_ASM68720v1) using BWA mem (Li
and Durbin 2009) using default settings and only properly paired reads with a quality > 30
were retained using SAMtools v.1.11 (Li et al. 2009). SNPs were subsequently called using
GATK’s HaplotypeCaller tool (McKenna et al. 2010).
Contamination assessment
In order to avoid any bias in downstream analyses arising from contaminated historical
specimens, we first assembled a stringently-filtered vcf based exclusively on recent
samples. Specimens showing more than 20% missing data were discarded and only
biallelic SNPs (--max-alleles 2) with a minimal SNP quality (--minQ) of 40 and an individual
genotype (--minGQ) quality of 30, present in at least 50% of the individuals (--max-missing)
and a minimum allele frequency (--maf) of 0.01 were retained with VCFtools (Danecek et
al. 2011). This resulted in a dataset of 31,012 SNPs covering 10,349 RADtags and 0.3% of
the genome. These reference SNPs were then subsequently extracted from all individuals,
for example, both historical as well as contemporary specimens, to limit the erroneous
inclusion of exogenous DNA sequences from historical samples. As the SNP discovery
protocol is exclusively applied on recent samples, this could, however, result in a SNP
ascertainment bias and concomitant underestimation of genetic diversity in historical
populations or erroneous measures of genetic differentiation (Lachance and Tishkoff 2013
). To eliminate such bias, one should identify a sufficient number of high-quality historical
samples with minimal missing data and repopulate the SNP discovery pipeline with this
extended dataset.
Statistical analysis
We ran a one-way ANOVA to test for differences in mean number of missing SNPs
amongst the three time periods and allowed for period-specific variances to account for
heteroscedasticity using the R package ‘nlme’ (Pinheiro et al. 2022). To predict the success
rate of ddRAD in museum samples, we applied generalised additive models (GAM) to
relate the percentage of missing SNPs per individual to either DNA concentration or
fragmentation using the R package ‘mgcv’ (Wood 2011). All statistical analyses were
performed using the R 4.1.2. software (R Core Team 2021). DNA fragmentation was
assessed from Bioanalyzer profiles by calculating the percentage of the area under the
curve in four distinct bins, i.e. bins that contain fragments ranging from, respectively,
35-200 bp, 200-400 bp, 400-700 bp or 700-10380 bp.
specimens, 43.4% for specimens sampled around the 1970s and 85.4% for specimens
originating from around the 1930s. The variance in missing data varied significantly
between time periods (Breusch-Pagan test, chi-square = 52.1, p < 0.001). Recent samples
showed consistently few missing SNPs, while the success rate in samples from the 1930s
varied slightly more. In contrast, samples from the 1970s showed large variation in missing
data, ranging from highly-successful samples to those that failed almost completely,
complicating the utility of age of the sample as a suitable predictor for success of RRS of
museum specimens.
Figure 8.
Percentage of missing SNP data per individual of Tyto alba alba in museum specimens of
different ages and recently collected material.
Mean DNA concentration in historical and recent samples were respectively 20.2 ng/µl ±
12.4 (SD) and 30.6 ng/µl ± 13.9 (SD). A simple linear regression indicated the number of
missing SNPs was not related to DNA concentration in recent samples (F1,57 = 0.016, p =
0.90). In contrast, a GAM indicated DNA concentration was inversely related to the amount
of missing data in historical samples (F1,3.2 = 15.97, p < 0.001) and explained 66.8% of the
deviance (Fig. 9).
GAMs relating the amount of missing data to the percentage of fragments between 35-200
bp, 200-400 bp, 400-700 bp and 700-10,380 bp explained, respectively, 74.8%, 20.7%,
39.7% and 78.4% of the model deviance. The amount of fragments in the lowest bin range
was strongly positively associated with the levels of missing data (F1,2.3 = 32.99, p <
0.001), while those at the highest bin range showed a clear negative association (F1,2.4 =
37.63, p < 0.001) (Fig. 10). Based on the latter model, the predicted amount of missing
42 Ferrari G et al
data when 1%, 5%, 10%, 20%, 30% or 50% of fragments ranged between 700 bp and
10,380 bp was, respectively, 88%, 77%, 65%, 42%, 23% and 4%, clearly illustrating the
association between missing data and fragmentation.
Figure 9.
The association between DNA concentration and percentage of missing SNPs in historical and
contemporary samples of Tyto alba alba. The dashed line represents the predicted values
according to the fitted GAM (historical samples) and linear regression (recent samples).
To date, few studies have assessed whether RRS on museum collections is feasible, and if
so, how to optimise approaches. In a previous study using ddRAD target enriched
sequencing, an inclusion threshold for DNA concentration of 30 ng/µl was suggested (as
determined from the A260 values) (Souza et al. 2017). A similar finding emerges from our
study, as the percentage of missing data was notably lower from samples with DNA extract
concentrations above 30 ng/µl (Fig. 9). However, overall we note that DNA fragmentation
was a better predictor for the percentage of missing SNPs and successful sequencing,
compared to DNA concentration. DNA concentration was not always perfectly inversely
associated with DNA fragmentation as some samples with low DNA concentration also
showed low levels of DNA fragmentation, or conversely, some samples with high DNA
concentration were highly fragmented. Furthermore, DNA concentration of problematic
samples can be increased by eluting in smaller volumes or lysing more tissue during DNA
extractions, yet, fragmentation profiles will still remain unaffected. Lastly, unlike
fragmentation profiles, sample DNA concentrations are species and tissue dependent,
making it difficult to set a universal threshold.
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 43
Figure 10.
Inverse association between the percentage of DNA fragments in the highest bin range
(700-10380 bp) and percentage of missing SNPs in historical Tyto alba alba samples. The
dashed line represents the predicted values according to the fitted GAM.
ddRAD appears unsuitable to obtain sequence data from highly-fragmented samples (in
our case, the older museum samples dating from the 1930s and some more recently
collected material from the 1970s). More advanced target-capture-based technologies
such as HyRAD and HyRAD-X should be considered as an alternative (Suchan et al. 2016,
Schmid et al. 2017), although these technologies do require additional steps and higher
costs. However, obtaining population-level genomic data of museum specimens using
ddRAD may still remain feasible when sufficiently large museum collections are available.
Prioritising samples based on fragmentation profiles enables the targeting of effort on the
most promising samples, enabling production of high-quality data in a cost-efficient
manner.
• However, despite the challenges of using ddRAD on degraded DNA, we were able
to obtain ddRAD seq data from avian samples up to ca. 50 years old, and
screening the fragment profiles of the genomic DNA gave good predictions of levels
of missing data;
44 Ferrari G et al
Data availability: The raw sequencing data from this case study have been deposited at the
European Nucleotide Archive (ENA) under project PRJEB59169 with accession IDs
ERS14470037 - ERS14470133.
2005). UV disinfection was applied before and after each experiment. Clean lab coats,
masks, shoe covers and hair caps were worn for each experiment. Gloves were changed
after each tube opening. Contacts with other DNA labs were banned (only sterile material
was used and access to other labs was not permitted before or during the aDNA analysis).
Extraction negatives (samples treated like all others, but without any bone powder inside)
were included in all experiments. For tissue sampling, the outer layer of the bone was
removed by scraping off its surface using a structured tooth tungsten carbide cutter
attached to a hand rotary tool (8100 8v Max Rotary Tool). After 10 min of exposure to UV,
40-75 mg of bone powder was collected by drilling inside the bone fragment using the hand
rotary tool at 5000 rpm, with an engraving cutter (1.6 mm). DNA was extracted from the
bone powder following the protocol of Dabney et al. (2013b) and was eluted twice in 45 μl
of Tris-EDTA buffer with Tween-20. For one specimen (LAST9), four separate extractions
were performed. DNA extracts were then evaluated using fluorometry on a Qubit for total
double stranded DNA quantification and a Bioanalyzer for fragment-size profiling.
Table 5.
Bovid bone sample information, isolated DNA concentrations and proportions of reads mapped to
target and possible contaminant full genomes. ID: Tissue sample identification; conc: DNA
concentration in the DNA extract measured using Qubit; mapped: percentages of the deduplicated
paired-end reads mapping to the reference genomes of Bos taurus, Homo sapiens and Mus
musculus (separated by “/”); short: percentages of mapped reads smaller than 100 bp; long:
percentages of mapped reads longer than 300 bp (with insert between the paired reads); Neg1 and
Neg2: negative DNA extracts processed for each library.
A total of 7 to 51 ng of genomic DNA of each specimen was used as starting material for
the ‘Ultra’ single-tube DNA library preparation method described in Carøe et al. (2018).
DNA was not sheared. The protocol consists of an end repair step, a ligation of adapters
P3 and P5 (final concentration of 0.05 µM each), a fill-in reaction and purification using the
MinElute kit (Qiagen). qPCR was used to evaluate DNA quantities available for each
specimen for the indexing PCR. Ct values of 11.5 to 15.2 were measured by the qPCR.
Based on these Ct values, 10 to 13 cycles were applied to the indexing PCR in order to
perform an enrichment that would minimise impacts on the complexity of the DNA libraries
46 Ferrari G et al
(Carøe et al. 2018). The samples were multiplexed with samples from other projects and
extraction negative controls in two different libraries of six samples each and sent to
Novogene (UK) Company Limited to be sequenced on an Illumina NovaSeq 6000 using a
paired-end mode and 150 cycles and to produce 30 Gbp of raw data per library (Suppl.
material 6). A total of 12 hours split across three days was necessary for the library
preparations.
In total, 225.52 million reads (33.828 Gbp) were generated for the seven specimens and
the two controls (Suppl. material 6). Illumina adapters and bad quality bases were removed
using AdapterRemoval and options --trimns --trimqualities --minquality 2 (Schubert et al.
2016). Trimmed reads were assembled using PEAR (Zhang et al. 2014), mapped to the
nuclear reference genomes of Bos taurus, Homo sapiens and Mus musculus (ARS-
UCD1.2, GRCh38.p13 and GRCm39, respectively) and to the mitochondrial genome of an
auroch (isolate CPC98, GenBank accession number GU985279). Duplicated reads were
removed using MarkDuplicates (http://broadinstitute.github.io/picard/). Finally, post-mortem
damages were evaluated for data authentication using mapDamage2 (Jónsson et al. 2013)
and PMDtools (Skoglund et al. 2014). Two days of analyses were sufficient to perform the
bioinformatic analyses with access to a supercomputer.
Data authentication
The single-tube library preparation protocol applied here (Carøe et al. 2018) provided DNA
reads for all seven specimens tested, with varying proportions of DNA that mapped to the
reference genome of Bos taurus. The authentication of these reads is critical for
downstream analysis and should test both the ancient and endogenous origin of the reads
filtered for analysis. This includes checking signatures of degradation including nucleotide
alterations and DNA size profiles (Hofreiter et al. 2001, Renaud et al. 2019) and
comparisons with reads from negative controls and reads mapped to other genomes. Here,
most reads mapped to the bovine genome were smaller than 100 bp and in the same
range as those obtained by Carøe et al. (2018) from eight historic grey wolf skins of 90 to
146 years old (40-180 bp with an average of around 60 bp). Estimating insert sizes when
mapping paired reads is useful to evaluate the size distribution of DNA fragments in the
library. Thus, even though it is more expensive, generating paired-end reads and reads
larger than the average short read lengths revealed by the Bioanalyzer (i.e. 100 cycles or
more), enables the exclusion of reads obtained from longer DNA fragments, which may
correspond to recent contaminations. It is also important to filter out contaminant reads that
still map to the target genome. Indeed, short contaminant fragments can map to
evolutionary more conserved regions of divergent genomes. Thus, removing reads that
map both to the target genome and other divergent genomes is a useful precaution.
Competitive mapping can address this by mapping raw sequencing data to a concatenated
reference composed of the target species' genome and other possible contaminant
genomes, such as the human genome. The sequences aligned only to the target part of
the concatenated reference genome can be kept for downstream analyses (Feuerborn et
al. 2020). Further authentication would include a completely independent analysis (from
extraction to sequencing) to check the congruence of the results (Andreeva et al. 2022).
• DNA fragment-size profiles of the DNA extracts are indicative of the presence of
degraded DNA, but sequencing is necessary to evaluate percentages of
endogenous DNA.
Data availability: The raw sequencing data for this case study have been deposited at the
European Nucleotide Archive (ENA) under project PRJEB59185 with accession IDs
ERS14471070 - ERS14471078.
48 Ferrari G et al
Concluding remarks
The continually evolving landscape of sequencing platforms and chemistries is resulting in
an ever-expanding set of opportunities for unlocking the genomic resources held in natural
history collections and there is a general increase in the feasibility of museum specimen
sequencing. With the rapid expansion in the field of museomics, comes a pressing need for
the ongoing development, sharing and adoption of best practices. Areas of particular
importance include establishment of appropriate facilities, workflows and data verification
steps to minimise risks of contamination and sampling guidance which supports optimal
utilisation of museum specimens for genomic research. Another area of general
importance is attention to ethical issues associated with the use of specimens for genomic
science; many collections pre-date contemporary permit conditions or restrictions.
Guidelines for ethical issues associated with sampling specimens for genomic analysis are
mostly developed for human tissues and archaeofaunal remains (Prendergast and
Sawchuk 2018, Pálsdóttir et al. 2019); further dialogue (e.g. Canales et al. (2022)) and
policy development regarding best practice for genomic sampling of wider natural history
collections are needed.
Acknowledgements
We are grateful to Sean Prosser and Evegeny Zakharov (University of Guelph), Ben Price
(Natural History Museum London), Ryan Folk (Mississippi State University) and Chris
Raxworthy (American Museum of Natural History) for advice; to five reviewers and the
subject editor for their comments on the paper; to Suzanne Cubey at the Royal Botanic
Garden Edinburgh for facilitating access to the herbarium collection for tissue sampling; to
the Department of Biosciences at the University of Oslo for access to the ancient DNA
laboratory; to Bea De Cupere, Quentin Goffette, Mietje Germonpré and Annelise Folie from
the Royal Belgian Institute of Natural Sciences, Belgium for designing the project “LAST”,
authorising tissue sampling, giving access to RBINS specimens and to their associated
information; to the Kerkuilenwerkgroep Vlaanderen for providing tissue samples; to Lejia
Zhang (MfN Berlin) for isolating DNA and preparing libraries and Susan Mbedi and Sarah
Sparman (Berlin Center for Genomics in Biodiversity Research) for conducting quality
checks and sequencing.
Funding program
This research received support from the SYNTHESYS+ Project (http://
www.synthesys.info/) which was financed by European Community Research Infrastructure
Action under the H2020 Integrating Activities Programme, Project number 823827. The
Royal Botanic Garden Edinburgh acknowledges support from the Scottish Government's
Rural and Environment Science and Analytical Services Division (RESAS). Analyses at
RBGE were run on the Crop Diversity Bioinformatics Resource which is funded by BBSRC
BB/S019669/1. The Joint Experimental Molecular Unit (JEMU) of the Royal Belgian
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 49
Institute of Natural Sciences (RBINS, Brussels) and the Royal Museum for Central Africa
(RMCA, Tervuren) acknowledge funding support from the Belgian Science Policy (Belspo).
Grant title
This work was funded as part of the SYNTHESYS+ Project (http://www.synthesys.info)
Author contributions
GF, LE, MLH, SJ, TvR, GS, CV, MV and PMH conceived and designed the study, GF and
PMH produced the initial draft of paper and all authors contributed to augmenting, refining
and revising the manuscript. Case study 1 (insect mitogenomics) was led by TvR and JP;
Case study 2 (WGS of insects) by LE and MV; Case study 3 (genome skimming of plants)
by SJ and MM; Case study 4 (target capture of plants) by GF, MLH, CK, FP and PMH;
Case study 5 (RAD seq of birds) by CV; Case study 6 (bovid bone sequencing) by GS.
Conflicts of interest
The authors have declared that no competing interests exist.
References
• Austin R, Sholts S, Williams L, Kistler L, Hofman C (2019) To curate the molecular past,
museums need a carefully considered set of best practices. Proceedings of the National
Academy of Sciences of the United States of America 116 (5): 1471‑1474. https://
doi.org/10.1073/pnas.1822038116
• Bakker F, Lei D, Yu J, Mohammadin S, Wei Z, van de Kerke S, Gravendeel B,
Nieuwenhuis M, Staats M, Alquezar-Planas D, Holmer R (2016) Herbarium genomics:
plastome sequence assembly from a range of herbarium specimens using an Iterative
Organelle Genome Assembly pipeline. Biological Journal of the Linnean Society 117
(1): 33‑43. https://doi.org/10.1111/bij.12642
• Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov A, Lesin V, Nikolenko
S, Pham S, Prjibelski A, Pyshkin A, Sirotkin A, Vyahhi N, Tesler G, Alekseyev M,
Pevzner P (2012) SPAdes: a new genome assembly algorithm and its applications to
single-cell sequencing. Journal of Computational Biology 19 (5): 455‑477. https://
doi.org/10.1089/cmb.2012.0021
• Bebber D, Carine M, Wood JI, Wortley A, Harris D, Prance G, Davidse G, Paige J,
Pennington T, Robson NB, Scotland R (2010) Herbaria are a major frontier for species
discovery. Proceedings of the National Academy of Sciences of the United States of
America 107 (51): 22169‑22171. https://doi.org/10.1073/pnas.1011841108
• Besnard G, Bertrand JM, Delahaie B, Bourgeois YC, Lhuillier E, Thébaud C (2016)
Valuing museum specimens: high-throughput DNA sequencing on historical collections
of New Guinea crowned pigeons (Goura). Biological Journal of the Linnean Society 117
(1): 71‑82. https://doi.org/10.1111/bij.12494
• Besnard G, Gaudeul M, Lavergne S, Muller S, Rouhan G, Sukhorukov A,
Vanderpoorten A, Jabbour F (2018) Herbarium-based science in the twenty-first
century. Botany Letters 165 (3-4): 323‑327. https://doi.org/
10.1080/23818107.2018.1482783
• Bieker V, Sánchez Barreiro F, Rasmussen J, Brunier M, Wales N, Martin M (2020)
Metagenomic analysis of historical herbarium specimens reveals a postmortem
microbial community. Molecular Ecology Resources 20 (5): 1206‑1219. https://doi.org/
10.1111/1755-0998.13174
• Billerman S, Walsh J (2019) Historical DNA as a tool to address key questions in avian
biology and evolution: A review of methods, challenges, applications, and future
directions. Molecular Ecology Resources 19 (5): 1115‑1130. https://doi.org/
10.1111/1755-0998.13066
• Blaimer B, Lloyd M, Guillory W, Brady S (2016) Sequence capture and phylogenetic
utility of genomic ultraconserved elements obtained from pinned insect Specimens.
PLOS One 11 (8): e0161531. https://doi.org/10.1371/journal.pone.0161531
• Bolger A, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina
sequence data. Bioinformatics 30 (15): 2114‑2120. https://doi.org/10.1093/
bioinformatics/btu170
• Bradshaw M, Carey J, Liu M, Bartholomew H, Jurick W, Hambleton S, Hendricks D,
Schnittler M, Scholler M (2023) Genetic time traveling: sequencing old herbarium
specimens, including the oldest herbarium specimen sequenced from kingdom Fungi,
reveals the population structure of an agriculturally significant rust. New Phytologist 237
(4): 1463‑1473. https://doi.org/10.1111/nph.18622
• Brewer G, Clarkson J, Maurin O, Zuntini A, Barber V, Bellot S, Biggs N, Cowan R,
Davies NJ, Dodsworth S, Edwards S, Eiserhardt W, Epitawalage N, Frisby S, Grall A,
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 51
yields but consistent sequencing success. Frontiers in Plant Science 12: 669064.
https://doi.org/10.3389/fpls.2021.669064
• Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucleic Acids Research 30 (14):
3059‑3066. https://doi.org/10.1093/nar/gkf436
• Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, Buxton S, Cooper
A, Markowitz S, Duran C, Thierer T, Ashton B, Meintjes P, Drummond A (2012)
Geneious Basic: an integrated and extendable desktop software platform for the
organization and analysis of sequence data. Bioinformatics 28 (12): 1647‑1649. https://
doi.org/10.1093/bioinformatics/bts199
• Kircher M, Sawyer S, Meyer M (2012) Double indexing overcomes inaccuracies in
multiplex sequencing on the Illumina platform. Nucleic Acids Research 40 (1): e3.
https://doi.org/10.1093/nar/gkr771
• Kistler L, Ware R, Smith O, Collins M, Allaby R (2017) A new model for ancient DNA
decay based on paleogenomic meta-analysis. Nucleic Acids Research 45 (11):
6310‑6320. https://doi.org/10.1093/nar/gkx361
• Kjær K, Winther Pedersen M, De Sanctis B, De Cahsan B, Korneliussen T, Michelsen
C, Sand K, Jelavić S, Ruter A, Schmidt AA, Kjeldsen K, Tesakov A, Snowball I, Gosse J,
Alsos I, Wang Y, Dockter C, Rasmussen M, Jørgensen M, Skadhauge B, Prohaska A,
Kristensen JÅ, Bjerager M, Allentoft M, Coissac E, PhyloNorway C, Rouillard A,
Simakova A, Fernandez-Guerra A, Bowler C, Macias-Fauria M, Vinner L, Welch J, Hidy
A, Sikora M, Collins M, Durbin R, Larsen N, Willerslev E (2022) A 2-million-year-old
ecosystem in Greenland uncovered by environmental DNA. Nature 612 (7939):
283‑291. https://doi.org/10.1038/s41586-022-05453-y
• Knapp M, Clarke A, Horsburgh KA, Matisoo-Smith E (2012) Setting the stage - building
and working in an ancient DNA laboratory. Annals of Anatomy 194 (1): 3‑6. https://
doi.org/10.1016/j.aanat.2011.03.008
• Knyshov A, Gordon EL, Weirauch C (2019) Cost‐efficient high throughput capture of
museum arthropod specimen DNA using PCR‐generated baits. Methods in Ecology and
Evolution 10 (6): 841‑852. https://doi.org/10.1111/2041-210x.13169
• Korlević P, McAlister E, Mayho M, Makunin A, Flicek P, Lawniczak MN (2021) A
minimally morphologically destructive approach for DNA retrieval and whole genome
shotgun sequencing of pinned historic Dipteran vector species. Genome Biology and
Evolution 13 (10): evab226. https://doi.org/10.1093/gbe/evab226
• Krause J, Briggs A, Kircher M, Maricic T, Zwyns N, Derevianko A, Pääbo S (2010) A
complete mtDNA genome of an early modern human from Kostenki, Russia. Current
Biology 20 (3): 231‑236. https://doi.org/10.1016/j.cub.2009.11.068
• Lachance J, Tishkoff S (2013) SNP ascertainment bias in population genetic analyses:
why it is important, and how to correct it. Bioessays 35 (9): 780‑786. https://doi.org/
10.1002/bies.201300014
• Lalueza-Fox C (2022) Museomics. Current Biology 32 (21): R1214‑R1215. https://
doi.org/10.1016/j.cub.2022.09.019
• Langmead B, Salzberg S (2012) Fast gapped-read alignment with Bowtie 2. Nature
Methods 9 (4): 357‑359. https://doi.org/10.1038/nmeth.1923
• Lang PM, Weiß C, Kersten S, Latorre S, Nagel S, Nickel B, Meyer M, Burbano H (2020)
Hybridization ddRAD-sequencing for population genomics of nonmodel plants using
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 57
• Marx V (2017) How to deduplicate PCR. Nature Methods 14 (5): 473‑476. https://
doi.org/10.1038/nmeth.4268
• Mascarello M, Amalfi M, Asselman P, Smets E, Hardy O, Beeckman H, Janssens S
(2021) Genome skimming reveals novel plastid markers for the molecular identification
of illegally logged African timber species. PLOS One 16 (6): e0251655. https://doi.org/
10.1371/journal.pone.0251655
• Matasci N, Hung L, Yan Z, Carpenter E, Wickett N, Mirarab S, Nguyen N, Warnow T,
Ayyampalayam S, Barker M, Burleigh JG, Gitzendanner M, Wafula E, Der J,
dePamphilis C, Roure B, Philippe H, Ruhfel B, Miles N, Graham S, Mathews S, Surek
B, Melkonian M, Soltis D, Soltis P, Rothfels C, Pokorny L, Shaw J, DeGironimo L,
Stevenson D, Villarreal JC, Chen T, Kutchan T, Rolf M, Baucom R, Deyholos M,
Samudrala R, Tian Z, Wu X, Sun X, Zhang Y, Wang J, Leebens-Mack J, Wong GK
(2014) Data access for the 1,000 Plants (1KP) project. GigaScience 3 (1):
2047-217X-3-17. https://doi.org/10.1186/2047-217X-3-17
• Mayer C, Dietz L, Call E, Kukowka S, Martin S, Espeland M (2021) Adding leaves to the
Lepidoptera tree: capturing hundreds of nuclear genes from old museum specimens.
Systematic Entomology 46 (3): 649‑671. https://doi.org/10.1111/syen.12481
• McCormack J, Tsai WE, Faircloth B (2016) Sequence capture of ultraconserved
elements from bird museum specimens. Molecular Ecology Resources 16 (5):
1189‑1203. https://doi.org/10.1111/1755-0998.12466
• McDonough M, Parker L, Rotzel McInerney N, Campana M, Maldonado J (2018)
Performance of commonly requested destructive museum samples for mammalian
genomic studies. Journal of Mammalogy 99 (4): 789‑802. https://doi.org/10.1093/
jmammal/gyy080
• McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K,
Altshuler D, Gabriel S, Daly M, DePristo M (2010) The Genome Analysis Toolkit: a
MapReduce framework for analyzing next-generation DNA sequencing data. Genome
Research 20 (9): 1297‑1303. https://doi.org/10.1101/gr.107524.110
• Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly
multiplexed target capture and sequencing. Cold Spring Harbor Protocols 2010 (6):
pdb‑prot5448. https://doi.org/10.1101/pdb.prot5448
• Mikheyev A, Zwick A, Magrath ML, Grau M, Qiu L, Su YN, Yeates D (2017) Museum
genomics confirms that the Lord Howe Island stick insect survived extinction. Current
Biology 27 (20): 3157‑3161. https://doi.org/10.1016/j.cub.2017.08.058
• Miller S, Barrow L, Ehlman S, Goodheart J, Greiman S, Lutz H, Misiewicz T, Smith S,
Tan M, Thawley C, Cook J, Light J (2020) Building natural history collections for the
twenty-first century and beyond. BioScience 70 (8): 674‑687. https://doi.org/10.1093/
biosci/biaa069
• Mullin V, Stephen W, Arce A, Nash W, Raine C, Notton D, Whiffin A, Blagderov V,
Gharbi K, Hogan J, Hunter T, Irish N, Jackson S, Judd S, Watkins C, Haerty W, Ollerton
J, Brace S, Gill R, Barnes I (2023) First large‐scale quantification study of DNA
preservation in insects from natural history collections using genome‐wide sequencing.
Methods in Ecology and Evolution 14 (2): 360‑371. https://doi.org/10.1111/2041-210x.
13945
• Nadeau N, Ruiz M, Salazar P, Counterman B, Medina JA, Ortiz-Zuazaga H, Morrison A,
McMillan WO, Jiggins C, Papa R (2014) Population genomics of parallel hybrid zones in
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 59
• Raxworthy C, Smith BT (2021) Mining museums for historical DNA: advances and
challenges in museomics. Trends in Ecology & Evolution 36 (11): 1049‑1060. https://
doi.org/10.1016/j.tree.2021.07.009
• Rayo E, Ferrari G, Neukamm J, Akgül G, Breidenstein A, Cooke M, Phillips C,
Bouwman A, Rühli F, Schuenemann V (2022) Non-destructive extraction of DNA from
preserved tissues in medical collections. BioTechniques 72 (2): 60‑64. https://doi.org/
10.2144/btn-2021-0014
• Renaud G, Schubert M, Sawyer S, Orlando L (2019) Authentication and assessment of
contamination in ancient DNA. Methods in Molecular Biology 1963: 163‑194. https://
doi.org/10.1007/978-1-4939-9176-1_17
• Ristaino J (2020) The importance of mycological and plant herbaria in tracking plant
killers. Frontiers in Ecology and Evolution 7: 521. https://doi.org/10.3389/fevo.
2019.00521
• Rochette N, Rivera-Colón A, Walsh J, Sanger T, Campbell-Staton S, Catchen J (2023)
On the causes, consequences, and avoidance of PCR duplicates: towards a theory of
library complexity. Molecular Ecology Resources 23: 1299‑1318. https://doi.org/
10.1111/1755-0998.13800
• Rohland N, Siedel H, Hofreiter M (2004) Nondestructive DNA extraction method for
mitochondrial DNA analyses of museum specimens. BioTechniques 36 (5): 814‑821.
https://doi.org/10.2144/04365ST05
• Roycroft E, Moritz C, Rowe K, Moussalli A, Eldridge MB, Portela Miguez R, Piggott M,
Potter S (2022) Sequence capture from historical museum specimens: maximizing
value for population and phylogenomic studies. Frontiers in Ecology and Evolution 10:
931644. https://doi.org/10.3389/fevo.2022.931644
• Ruiz-Gartzia I, Lizano E, Marques-Bonet T, Kelley J (2022) Recovering the genomes
hidden in museum wet collections. Molecular Ecology Resources 22 (6): 2127‑2129.
https://doi.org/10.1111/1755-0998.13631
• Sánchez Barreiro F, Vieira F, Martin M, Haile J, Gilbert MTP, Wales N (2017)
Characterizing restriction enzyme-associated loci in historic ragweed (Ambrosia
artemisiifolia) voucher specimens using custom-designed RNA probes. Molecular
Ecology Resources 17 (2): 209‑220. https://doi.org/10.1111/1755-0998.12610
• Santos B, Miller M, Miklasevskaja M, McKeown JA, Redmond N, Coddington J, Bird J,
Miller S, Smith A, Brady S, Buffington M, Chamorro ML, Dikow T, Gates M, Goldstein P,
Konstantinov A, Kula R, Silverson N, Solis MA, deWaard S, Naik S, Nikolova N,
Pentinsaari M, Prosser SJ, Sones J, Zakharov E, deWaard J (2023) Enhancing DNA
barcode reference libraries by harvesting terrestrial arthropods at the National Museum
of Natural History. Biodiversity Data Journal 11: e100904. https://doi.org/10.3897/BDJ.
11.e100904
• Särkinen T, Staats M, Richardson J, Cowan R, Bakker F (2012) How to open the
treasure chest? Optimising DNA extraction from herbarium specimens. PLOS One 7
(8): e43808. https://doi.org/10.1371/journal.pone.0043808
• Sawyer S, Krause J, Guschanski K, Savolainen V, Pääbo S (2012) Temporal patterns of
nucleotide misincorporations and DNA fragmentation in ancient DNA. PLOS One 7 (3):
e34131. https://doi.org/10.1371/journal.pone.0034131
• Schmid S, Genevest R, Gobet E, Suchan T, Sperisen C, Tinner W, Alvarez N (2017) Hy
RAD ‐X, a versatile method combining exome capture and RAD sequencing to extract
Developing the Protocol Infrastructure for DNA Sequencing Natural History ... 61
genomic information from ancient DNA. Methods in Ecology and Evolution 8 (10):
1374‑1388. https://doi.org/10.1111/2041-210x.12785
• Schubert M, Ermini L, Der Sarkissian C, Jónsson H, Ginolhac A, Schaefer R, Martin M,
Fernández R, Kircher M, McCue M, Willerslev E, Orlando L (2014) Characterization of
ancient and modern genomes by SNP detection and phylogenomic and metagenomic
analysis using PALEOMIX. Nature Protocols 9 (5): 1056‑1082. https://doi.org/10.1038/
nprot.2014.063
• Schubert M, Lindgreen S, Orlando L (2016) AdapterRemoval v2: rapid adapter
trimming, identification, and read merging. BMC Research Notes 9: 88. https://doi.org/
10.1186/s13104-016-1900-2
• Seguin-Orlando A, Hoover C, Vasiliev S, Ovodov N, Shapiro B, Cooper A, Rubin E,
Willerslev E, Orlando L (2015) Amplification of TruSeq ancient DNA libraries with
AccuPrime Pfx: consequences on nucleotide misincorporation and methylation patterns.
STAR: Science & Technology of Archaeological Research 1 (1): 1‑9. https://doi.org/
10.1179/2054892315Y.0000000005
• Shepherd L (2017) A non-destructive DNA sampling technique for herbarium
specimens. PLOS One 12 (8): e0183555. https://doi.org/10.1371/journal.pone.0183555
• Shi L, Chen H, Jiang M, Wang L, Wu X, Huang L, Liu C (2019) CPGAVAS2, an
integrated plastome sequence annotator and analyzer. Nucleic Acids Research 47
(W1): W65‑W73. https://doi.org/10.1093/nar/gkz345
• Short A, Dikow T, Moreau C (2018) Entomological collections in the age of big data.
Annual Review of Entomology 63: 513‑530. https://doi.org/10.1146/annurev-
ento-031616-035536
• Skoglund P, Northoff B, Shunkov M, Derevianko A, Pääbo S, Krause J, Jakobsson M
(2014) Separating endogenous ancient DNA from modern day contamination in a
Siberian Neandertal. Proceedings of the National Academy of Sciences of the United
States of America 111 (6): 2229‑2234. https://doi.org/10.1073/pnas.1318934111
• Souza C, Murphy N, Villacorta-Rath C, Woodings L, Ilyushkina I, Hernandez C, Green
B, Bell J, Strugnell J (2017) Efficiency of ddRAD target enriched sequencing across
spiny rock lobster species (Palinuridae: Jasus). Scientific Reports 7: 6781. https://
doi.org/10.1038/s41598-017-06582-5
• Souza HAV, Muller LAC, Brandão RL, Lovato MB (2012) Isolation of high quality and
polysaccharide-free DNA from leaves of Dimorphandra mollis (Leguminosae), a tree
from the Brazilian Cerrado. Genetic and Molecular Research 11 (1): 756‑764. https://
doi.org/10.4238/2012.March.22.6
• Soza V, Lindsley D, Waalkes A, Ramage E, Patwardhan R, Burton J, Adey A, Kumar A,
Qiu R, Shendure J, Hall B (2019) The Rhododendron genome and chromosomal
organization provide insight into shared whole-genome duplications across the heath
family (Ericaceae). Genome Biology and Evolution 11 (12): 3353‑3371. https://doi.org/
10.1093/gbe/evz245
• Speer K, Hawkins MR, Flores M, McGowen M, Fleischer R, Maldonado J, Campana M,
Muletz-Wolz C (2022) A comparative study of RNA yields from museum specimens,
including an optimized protocol for extracting RNA from formalin-fixed specimens.
Frontiers in Ecology and Evolution 10: 953131. https://doi.org/10.3389/fevo.
2022.953131
62 Ferrari G et al
Supplementary materials
Suppl. material 6: Case study 6 sample, DNA and read data description