An Overview of Next-Generation Sequencing
An Overview of Next-Generation Sequencing
Sequencing
Article Published: March 17, 2021|Athina Gkazi, PhD
Over the last 56 years, researchers have been developing methods and
technologies to assist in the determination of nucleic acid sequences in biological
samples. Our ability to sequence DNA and RNA accurately has had a great impact
in numerous research fields. This article discusses what next-generation
sequencing (NGS) is, advances in the technology and its applications.
Contents
The structure of DNA was determined in 1953 by Watson and Crick based on the
fundamental DNA crystallography and X-ray diffraction work of Rosalind
Franklin.1,2 However, the first molecule to be sequenced was actually RNA – tRNA –
in 1965 by Robert Holley and RNA of bacteriophage MS2 later on.3,4 Various
research groups then began adapting these methods to sequence DNA with a
breakthrough coming in 1977 by Fredrick Sanger and colleagues, developing the
chain-termination method.5 By 1986, the first automated DNA sequencing method
had been developed.6,7 This was the beginning of a golden era for the
development and refinement of sequencing platforms, including the pivotal
capillary DNA sequencer.
The key principles behind Sanger sequencing and 2G NGS share some
similarities.11,12 In 2G NGS, the genetic material (DNA or RNA) is fragmented, to
which oligonucleotides of known sequences are attached, through a step known as
adapter ligation, enabling the fragments to interact with the chosen sequencing
system. The bases of each fragment are then identified by their emitted signals.
The main difference between Sanger sequencing and 2G NGS stems from
sequencing volume, with NGS allowing the processing of millions of reactions in
parallel, resulting in high-throughput, higher sensitivity, speed and reduced cost. A
plethora of genome sequencing projects that took many years with Sanger
sequencing methods could now be completed within hours using NGS.
There are two main approaches in NGS technology, short-read and long-read
sequencing, each with its own advantages and limitations (Table 1).13 The main
scope for investing in the development of NGS is its wide applicability in both
clinical and research settings. In clinical settings, NGS is used to diagnose various
disorders, via identification of germline or somatic mutations.14,15 The shift
towards NGS in clinical practice is justified by the power of the technique paired
with the continually declining costs. NGS is also a valuable tool in metagenomic
studies and used for infectious disease diagnostics, monitoring and
management.16,17 In 2020, NGS methods were pivotal in characterizing the SARS-
CoV-2 genome and are constantly contributing in monitoring the COVID-19
pandemic.18,19
Figure 1: The evolution of sequencing methodologies.
The term NGS is often taken to mean 2G technologies, however, third (3G) and
fourth (4G) generation technologies have since evolved that work on different
underlying principles.
Proton detection sequencing relies on counting hydrogen ions released during the
polymerization of DNA. Unlike other techniques, it does not use fluorescence and
does not use modified nucleotides or optics. Instead, pH changes are detected by
semiconductor sensor chips and converted to digital information.20
By far the most popular SBS method is reversible terminator sequencing which
utilizes ‘’bridge-amplification’’. During the synthesis reactions, the fragments bind
to oligonucleotides on the flow cell, creating a bridge from one side of the
sequence (P5 oligo on flow cell) to the other (P7), which is then amplified. The
added fluorescently-labeled nucleotides are detected using direct imaging.23
Unlike SBS, sequencing by ligation does not use DNA polymerase to create a
second strand. The sensitivity of DNA ligase to base-pairing mismatches is utilized
instead, with the fluorescence produced used to determine the target sequence.
Digital images taken after each reaction are then used for analysis. DNA nanoball
sequencing is a form of sequencing by ligation that exploits rolling circle
replication. Concatenated DNA copies are compacted into DNA nanoballs and
bound to sequencing slides in a dense grid of spots ready for ligation-based
sequencing reactions.24,25 Whilst the nanoball technique reduces running costs,
the short sequences produced can be problematic for read mapping.
How To Guide
Nevertheless, high costs, high error rates, large quantities of sequencing data and
low read depth can be problematic.32,33
Nucleic acids (DNA or RNA) are extracted from the selected samples (blood,
sputum, bone marrow etc.). Extracted samples are quality control (QC) checked,
using standard methods (spectrophotometric, fluorometric or gel electrophoretic).
If using RNA, this must be reverse transcribed into cDNA, however some library
preparation kits may include this step.
The final libraries can undergo QC checks using qPCR, to confirm DNA quality and
quantity. This will also allow the correct concentration of sample to be prepared
for sequencing.
(3) Sequencing
The generated data files are analyzed depending on the workflow used. Analysis
methods are highly dependent on the aim of the study.36–38
Whilst they may reduce the amount of samples that can be analyzed in a given run,
paired-end and mate pair sequencing offer advantages in downstream data
analysis, particularly for de novo assemblies. The techniques link sequencing reads
together that are read from both ends of a fragment (paired-end) or are separated
by an intervening DNA region (mate pair).
There are clearly many options when it comes to selecting a sequencing strategy.
The following are some of the key considerations when deciding on the
appropriate library preparation and sequencing platform:
(d) DNA or RNA sequencing – do you need to look at the genome or transcriptome?
(l) Bioinformatic tools – experiment dependent. Depending on the sample and the
biological question, the entire process of sequence analysis can be adapted.
Advantages Limitations
Shor
· Not able to resolve structural
t-
· Higher sequence fidelity variants, phasing alleles or distinguish
read
· Cheap highly homologous genomic regions
sequ
· Can sequence fragmented DNA · Unable to provide coverage of some
enci
repetitive regions
ng
Any kind of NGS technology generates a significant amount of output data. The
basics of sequence analysis follow a centralized workflow which includes a raw
read QC step, pre-processing and mapping, followed by post-alignment processing,
variant annotation, variant calling and visualization.
After the quality of the reads has been checked and pre-processing performed, the
next step will depend on the existence of a reference genome. In the case of a de
novo genome assembly, the generated sequences are aligned into contigs using
their overlapping regions. This is often done with the assistance of processing
pipelines that can include scaffolding steps to help with contig ordering,
orientation and the removal of repetitive regions, thus increasing the assembly
continuity.40,41 If the generated sequences are mapped (aligned) to a reference
genome or transcriptome, variations compared to the reference sequence can be
identified. Today, there is a plethora of mapping tools (more than 60), that have
been adapted to handle the growing quantities of data generated by NGS, exploit
technological advancements and tackle protocol developments.42 One difficulty,
due to the increasing number of mappers, is being able to find the most suitable
one. Information is usually scattered through publications, source codes (when
available), manuals and other documentation. Some of the tools will also offer a
mapping quality check that is necessary as some biases will only show after the
mapping step. Similar to quality control prior to mapping, the correct processing of
mapped reads is a crucial step, during which duplicated mapped reads (including
but not limited to PCR artifacts) are removed. This is a standardized method, and
most tools share common features. Once the reads have been mapped and
processed, they need to be analyzed in an experiment-specific fashion, what is
known as variant analysis. This step can identify single nucleotide polymorphisms
(SNPs), indels (an insertion or deletion of bases), inversions, haplotypes,
differential gene transcription in the case of RNA-seq and much more. Despite the
multitude of tools for genome assembly, alignment and analysis, there is a
constant need for new and improved versions to ensure that the sensitivity,
accuracy and resolution can match the rapidly advancing NGS techniques.
The final step is visualization, for which data complexity can pose a significant
challenge. Depending on the experiment and the research questions posed, there
are a number of tools that can be used. If a reference genomes is available , the
Integrated Genome Viewer (IGV)is a popular choice43, as is the Genome Browser. If
experiments include WGS or WES, the Variant Explorer is a particularly good tool as
it can be used to sieve through thousands of variants and allow users to focus on
their most important findings. Visualization tools like VISTA allow for comparison
between different genomic sequences. Programs suitable for de novo genome
assemblies44 are more limited. However, tools like Bandage and Icarus have been
used to explore and analyze the assembled genomes.
NGS has enabled us to discover and study genomes in ways that were never
possible before. However, the complexity of the sample processing for NGS has
exposed bottlenecks in managing, analyzing and storing the datasets. One of the
main challenges is the computational resources required for the assembly,
annotation, and analysis of sequencing data.45 The vast amount of data generated
by NGS analysis is another critical challenge. Data centers are reaching high
storage capacity levels and are constantly trying to cope with increasing demands,
running the risk of permanent data loss.46 More strategies are continuously being
suggested with the aim to increase efficiency, reduce sequencing error, maximize
reproducibility and ensure correct data management.
Since the early 2000s NGS has become an invaluable tool in both research and
clinical/diagnostic settings for modern medicine and in drug discovery, with the
use of methods including WGS, WES, targeted sequencing, transcriptome,
epigenome and metagenome sequencing dramatically increasing. Figure 3
summarizes workflows and options for targeting different datasets.
Through WGS, researchers are able to study not only genes and their involvement
in disease in humans and animals, but also characteristics of microbial and
agricultural populations, providing important epidemiological and evolutionary
data.47–52 Thus far, there has been a plethora of studies where mutations,
rearrangements and fusion events were identified using WGS. Currently, WGS is
used for the surveillance of antimicrobial resistance, one of the major global health
challenges.53,54 As the costs are constantly decreasing, WGS is more frequently
used for resequencing the entire human genome in clinical samples and may soon
become routine in clinical practice.55 Ultimately, WGS will be needed to assign
functionality to the remaining majority of the genome and decipher its role in
diseases.
Their more focused nature make WES and targeted sequencing attractive options
for population and clinical studies.56,57 Despite having more limitations as the
name suggests, WES is an important clinical tool in the personalized medicine field.
Genetic diagnoses for certain diseases, like cancer, as well as genetic
characterization for other disorders can be achieved with this method in a more
cost-effective way than WGS.
In addition to the many applications that NGS has in sequencing DNA, it can also
be used for RNA analysis. This enables, for example, the genomes of RNA viruses,
such as SARS and influenza, to be determined. Importantly, RNA-seq is frequently
used in quantitative studies, facilitating not only the identification of transcribed
genes in a DNA genome, but also the level at which they are transcribed
(transcription level) according to the relative abundance of RNA transcripts.
Potential rearrangements of the DNA sequences may also be identified through
the identification of novel transcripts.58,59
RNA-seq RNA-sequencing
3G Third-generation sequencing
4G Fourth-generation sequencing
7. Hood LE, Hunkapiller MW, Smith LM. Automated DNA sequencing and
analysis of the human genome. Genomics. 1987;1(3):201-212.
doi:10.1016/0888-7543(87)90046-2
10. Shendure J, Porreca GJ, Reppas NB, et al. Molecular biology: Accurate
multiplex polony sequencing of an evolved bacterial genome. Science (80- ).
2005;309(5741):1728-1732. doi:10.1126/science.1117389
14. Rizzo JM, Buck MJ. Key principles and clinical applications of “next-
generation” DNA sequencing. Cancer Prev Res. 2012;5(7):887-900.
doi:10.1158/1940-6207.CAPR-11-0432
25. Drmanac R, Sparks AB, Callow MJ, et al. Human genome sequencing
using unchained base reads on self-assembling DNA nanoarrays. Science (80- ).
2010;327(5961):78-81. doi:10.1126/science.1181498
27. Peters EJ, McLeod HL. Editorial: Ability of whole-genome SNP arrays to
capture ’must have pharmacogenomic variants. Pharmacogenomics.
2008;9(11):1573-1577. doi:10.2217/14622416.9.11.1573
33. Roberts RJ, Carneiro MO, Schatz MC. The advantages of SMRT
sequencing. Genome Biol. 2013;14(6):405. doi:10.1186/gb-2013-14-6-405
35. Head SR, Kiyomi Komori H, LaMere SA, et al. Library construction for
next-generation sequencing: Overviews and challenges. Biotechniques.
2014;56(2):61-77. doi:10.2144/000114133
36. Goodwin S, McPherson JD, McCombie WR. Coming of age: Ten years of
next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333-351.
doi:10.1038/nrg.2016.49
37. Hess JF, Kohl TA, Kotrová M, et al. Library preparation for next
generation sequencing: A review of automation strategies. Biotechnol Adv.
2020;41. doi:10.1016/j.biotechadv.2020.107537
38. Metzker ML. Sequencing technologies the next generation. Nat Rev
Genet. 2010;11(1):31-46. doi:10.1038/nrg2626
42. Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-
throughput sequencing data. Bioinformatics. 2012;28(24):3169-3177.
doi:10.1093/bioinformatics/bts605
45. Scholz MB, Lo CC, Chain PSG. Next generation sequencing and
bioinformatic bottlenecks: The current state of metagenomic data
analysis. Curr Opin Biotechnol. 2012;23(1):9-15.
doi:10.1016/j.copbio.2011.11.013
49. Plassais J, Kim J, Davis BW, et al. Whole genome sequencing of canids
reveals genomic regions under selection and variants influencing
morphology. Nat Commun. 2019;10(1):1-14. doi:10.1038/s41467-019-09373-w
51. Salipante SJ, SenGupta DJ, Cummings LA, Land TA, Hoogestraat DR,
Cookson BT. Application of whole-genome sequencing for bacterial strain
typing in molecular epidemiology. J Clin Microbiol. 2015;53(4):1072-1079.
doi:10.1128/JCM.03385-14
52. Varshney RK, Nayak SN, May GD, Jackson SA. Next-generation
sequencing technologies and their implications for crop genetics and
breeding. Trends Biotechnol. 2009;27(9):522-530.
doi:10.1016/j.tibtech.2009.05.006
56. Suwinski P, Ong CK, Ling MHT, Poh YM, Khan AM, Ong HS. Advancing
personalized medicine through the application of whole exome sequencing
and big data analytics. Front Genet. 2019;10(FEB):49.
doi:10.3389/fgene.2019.00049
62. Chiu CY, Miller SA. Clinical metagenomics. Nat Rev Genet.
2019;20(6):341-355. doi:10.1038/s41576-019-0113-7
©2024 Technology Networks, all rights reserved, Part of the LabX Media Group