DNA Sequencing - Wikipedia
DNA Sequencing - Wikipedia
DNA sequencing is the process of determining the nucleic acid sequence – the order
of nucleotides in DNA. It includes any method or technology that is used to determine
the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of
rapid DNA sequencing methods has greatly accelerated biological and medical
research and discovery.[1][2]
The rapid speed of sequencing attained with modern DNA sequencing technology has
been instrumental in the sequencing of complete DNA sequences, or genomes, of
numerous types and species of life, including the human genome and other complete
DNA sequences of many animal, plant, and microbial species.
An example of the results of
automated chain-termination DNA
sequencing.
The first DNA sequences were obtained in the early 1970s by academic researchers
using laborious methods based on two-dimensional chromatography. Following the
development of fluorescence-based sequencing methods with a DNA sequencer,[6]
DNA sequencing has become easier and orders of magnitude faster.[7][8]
Applications
DNA sequencing may be used to determine the sequence of individual genes, larger
genetic regions (i.e. clusters of genes or operons), full chromosomes, or entire
genomes of any organism. DNA sequencing is also the most efficient way to
indirectly sequence RNA or proteins (via their open reading frames). In fact, DNA
sequencing has become a key technology in many areas of biology and other
sciences such as medicine, forensics, and anthropology.
Molecular biology
Sequencing is used in molecular biology to study genomes and the proteins they
encode. Information obtained using sequencing allows researchers to identify
changes in genes and noncoding DNA (including regulatory sequences), associations
with diseases and phenotypes, and identify potential drug targets.
Evolutionary biology
Since DNA is an informative macromolecule in terms of transmission from one
generation to another, DNA sequencing is used in evolutionary biology to study how
different organisms are related and how they evolved. In February 2021, scientists
reported, for the first time, the sequencing of DNA from animal remains, a mammoth
in this instance, over a million years old, the oldest DNA sequenced to date.[9][10]
Metagenomics
The field of metagenomics involves identification of organisms present in a body of
water, sewage, dirt, debris filtered from the air, or swab samples from organisms.
Knowing which organisms are present in a particular environment is critical to
research in ecology, epidemiology, microbiology, and other fields. Sequencing
enables researchers to determine which types of microbes may be present in a
microbiome, for example.
Virology
As most viruses are too small to be seen by a light microscope, sequencing is one of
f [11]
based in DNA or RNA. RNA viruses are more time-sensitive for genome sequencing,
as they degrade faster in clinical samples.[12] Traditional Sanger sequencing and next-
generation sequencing are used to sequence viruses in basic and clinical research, as
well as for the diagnosis of emerging viral infections, molecular epidemiology of viral
pathogens, and drug-resistance testing. There are more than 2.3 million unique viral
sequences in GenBank.[11] Recently, NGS has surpassed traditional Sanger as the
most popular approach for generating viral genomes.[11]
During the 1990 avian influenza outbreak, viral sequencing determined that the
influenza sub-type originated through reassortment between quail and poultry. This
led to legislation in Hong Kong that prohibited selling live quail and poultry together at
market. Viral sequencing can also be used to estimate when a viral outbreak began by
using a molecular clock technique.[12]
Medicine
Medical technicians may sequence genes (or, theoretically, full genomes) from
patients to determine if there is risk of genetic diseases. This is a form of genetic
testing, though some genetic tests may not involve DNA sequencing.
As of 2013 DNA sequencing was increasingly used to diagnose and treat rare
diseases. As more and more genes are identified that cause rare genetic diseases,
molecular diagnoses for patients become more mainstream. DNA sequencing allows
clinicians to identify genetic diseases, improve disease management, provide
reproductive counseling, and more effective therapies.[13] Gene sequencing panels
are used to identify multiple potential genetic causes of a suspected disorder.[14]
Also, DNA sequencing may be useful for determining a specific bacteria, to allow for
more precise antibiotics treatments, hereby reducing the risk of creating antimicrobial
resistance in bacteria populations.[15][16][17][18][19][20]
Forensic investigation
DNA sequencing may be used along with DNA profiling methods for forensic
identification[21] and paternity testing. DNA testing has evolved tremendously in the
last few decades to ultimately link a DNA print to what is under investigation. The
DNA patterns in fingerprint, saliva, hair follicles, etc. uniquely separate each living
organism from another. Testing DNA is a technique which can detect specific
genomes in a DNA strand to produce a unique and individualized pattern.
In almost all organisms, DNA is synthesized in vivo using only the 4 canonical bases;
modification that occurs post replication creates other bases like 5 methyl C.
However, some bacteriophage can incorporate a non standard base directly.[26]
History
Discovery of DNA structure and
function
Deoxyribonucleic acid (DNA) was first discovered and isolated by Friedrich Miescher
in 1869, but it remained under-studied for many decades because proteins, rather
than DNA, were thought to hold the genetic blueprint to life. This situation changed
after 1944 as a result of some experiments by Oswald Avery, Colin MacLeod, and
Maclyn McCarty demonstrating that purified DNA could change one strain of bacteria
into another. This was the first time that DNA was shown capable of transforming the
properties of cells.
In 1953, James Watson and Francis Crick put forward their double-helix model of
DNA, based on crystallized X-ray structures being studied by Rosalind Franklin.
According to the model, DNA is composed of two strands of nucleotides coiled
around each other, linked together by hydrogen bonds and running in opposite
directions. Each strand is composed of four complementary nucleotides – adenine
(A), cytosine (C), guanine (G) and thymine (T) – with an A on one strand always paired
with T on the other, and C always paired with G. They proposed that such a structure
allowed each strand to be used to reconstruct the other, an idea central to the passing
on of hereditary information between generations.[28]
Frederick Sanger, a pioneer of
sequencing. Sanger is one of the few
scientists who was awarded two
Nobel prizes, one for the sequencing
of proteins, and the other for the
sequencing of DNA.
The foundation for sequencing proteins was first laid by the work of Frederick Sanger
who by 1955 had completed the sequence of all the amino acids in insulin, a small
protein secreted by the pancreas. This provided the first conclusive evidence that
proteins were chemical entities with a specific molecular pattern rather than a
random mixture of material suspended in fluid. Sanger's success in sequencing
insulin spurred on x-ray crystallographers, including Watson and Crick, who by now
were trying to understand how DNA directed the formation of proteins within a cell.
Soon after attending a series of lectures given by Frederick Sanger in October 1954,
Crick began developing a theory which argued that the arrangement of nucleotides in
DNA determined the sequence of amino acids in proteins, which in turn helped
determine the function of a protein. He published this theory in 1958.[29]
RNA sequencing
RNA sequencing was one of the earliest forms of nucleotide sequencing. The major
landmark of RNA sequencing is the sequence of the first complete gene and the
complete genome of Bacteriophage MS2, identified and published by Walter Fiers and
his coworkers at the University of Ghent (Ghent, Belgium), in 1972[30] and 1976.[31]
Traditional RNA sequencing methods require the creation of a cDNA molecule which
must be sequenced.[32]
The first full DNA genome to be sequenced was that of bacteriophage φX174 in
1977.[43] Medical Research Council scientists deciphered the complete DNA
sequence of the Epstein-Barr virus in 1984, finding it contained 172,282 nucleotides.
Completion of the sequence marked a significant turning point in DNA sequencing
because it was achieved with no prior genetic profile knowledge of the virus.[44][8]
By 2001, shotgun sequencing methods had been used to produce a draft sequence of
the human genome.[52][53]
High-throughput sequencing
(HTS) methods
Several new methods for DNA sequencing were developed in the mid to late 1990s
and were implemented in commercial DNA sequencers by 2000. Together these were
called the "next-generation" or "second-generation" sequencing (NGS) methods, in
order to distinguish them from the earlier methods, including Sanger sequencing. In
contrast to the first generation of sequencing, NGS technology is typically
characterized by being highly scalable, allowing the entire genome to be sequenced at
once. Usually, this is accomplished by fragmenting the genome into small pieces,
randomly sampling for a fragment, and sequencing it using one of a variety of
technologies, such as those described below. An entire genome is possible because
multiple fragments are sequenced at once (giving it the name "massively parallel"
sequencing) in an automated process.
NGS technology has tremendously empowered researchers to look for insights into
health, anthropologists to investigate human origins, and is catalyzing the
"Personalized Medicine" movement. However, it has also opened the door to more
room for error. There are many software tools to carry out the computational analysis
of NGS data, often compiled at online platforms such as CSI NGS Portal, each with its
own algorithm. Even the parameters within one software package can change the
outcome of the analysis. In addition, the large quantities of data produced by DNA
sequencing have also required development of new methods and programs for
sequence analysis. Several efforts to develop standards in the NGS field have been
attempted to address these challenges, most of which have been small-scale efforts
arising from individual labs. Most recently, a large, organized, FDA-funded effort has
culminated in the BioCompute standard.
On 26 October 1990, Roger Tsien, Pepi Ross, Margaret Fahnestock and Allan J
Johnston filed a patent describing stepwise ("base-by-base") sequencing with
removable 3' blockers on DNA arrays (blots and single DNA molecules).[55] In 1996,
Pål Nyrén and his student Mostafa Ronaghi at the Royal Institute of Technology in
Stockholm published their method of pyrosequencing.[56]
On 1 April 1997, Pascal Mayer and Laurent Farinelli submitted patents to the World
Intellectual Property Organization describing DNA colony sequencing.[57] The DNA
sample preparation and random surface-polymerase chain reaction (PCR) arraying
methods described in this patent, coupled to Roger Tsien et al.'s "base-by-base"
sequencing method, is now implemented in Illumina's Hi-Seq genome sequencers.
In 1998, Phil Green and Brent Ewing of the University of Washington described their
phred quality score for sequencer data analysis,[58] a landmark analysis technique
that gained widespread adoption, and which is still the most common metric for
assessing the accuracy of a sequencing platform.[59]
Maxam-Gilbert sequencing
Allan Maxam and Walter Gilbert published a DNA sequencing method in 1977 based
on chemical modification of DNA and subsequent cleavage at specific bases.[40] Also
known as chemical sequencing, this method allowed purified samples of double-
stranded DNA to be used without further cloning. This method's use of radioactive
labeling and its technical complexity discouraged extensive use after refinements in
the Sanger methods had been made.
Maxam-Gilbert sequencing requires radioactive labeling at one 5' end of the DNA and
purification of the DNA fragment to be sequenced. Chemical treatment then
generates breaks at a small proportion of one or two of the four nucleotide bases in
each of four reactions (G, A+G, C, C+T). The concentration of the modifying chemicals
is controlled to introduce on average one modification per DNA molecule. Thus a
series of labeled fragments is generated, from the radiolabeled end to the first "cut"
site in each molecule. The fragments in the four reactions are electrophoresed side
by side in denaturing acrylamide gels for size separation. To visualize the fragments,
the gel is exposed to X-ray film for autoradiography, yielding a series of dark bands
each corresponding to a radiolabeled DNA fragment, from which the sequence may
be inferred.[40]
Chain-termination methods
The chain-termination method developed by Frederick Sanger and coworkers in 1977
soon became the method of choice, owing to its relative ease and reliability.[39][62]
When invented, the chain-terminator method used fewer toxic chemicals and lower
amounts of radioactivity than the Maxam and Gilbert method. Because of its
comparative ease, the Sanger method was soon automated and was the method
used in the first generation of DNA sequencers.
Sanger sequencing is the method which prevailed from the 1980s until the mid-
2000s. Over that period, great advances were made in the technique, such as
fluorescent labelling, capillary electrophoresis, and general automation. These
developments allowed much more efficient sequencing, leading to lower costs. The
Sanger method, in mass production form, is the technology which produced the first
human genome in 2001, ushering in the age of genomics. However, later in the
decade, radically different approaches reached the market, bringing the cost per
genome down from $100 million in 2001 to $10,000 in 2011.[63]
Sequencing by synthesis
The objective for sequential sequencing by synthesis (SBS) is to determine the
sequencing of a DNA sample by detecting the incorporation of a nucleotide by a DNA
polymerase. An engineered polymerase is used to synthesize a copy of a single
strand of DNA and the incorporation of each nucleotide is monitored. The principle of
real-time sequencing by synthesis was first described in 1993[64] with improvements
published some years later.[65] The key parts are highly similar for all embodiments of
SBS and includes (1) amplification of DNA (to enhance the subsequent signal) and
attach the DNA to be sequenced to a solid support, (2) generation of single stranded
DNA on the solid support, (3) incorporation of nucleotides using an engineered
polymerase and (4) real-time detection of the incorporation of nucleotide The steps 3-
4 are repeated and the sequence is assembled from the signals obtained in step 4.
This principle of real-time sequencing-by-synthesis has been used for almost all
massive parallel sequencing instruments, including 454, PacBio, IonTorrent, Illumina
and MGI.
Large-scale sequencing
and de novo sequencing
Large-scale sequencing often aims at sequencing very long DNA pieces, such as
whole chromosomes, although large-scale sequencing can also be used to generate
very large numbers of short sequences, such as found in phage display. For longer
targets such as chromosomes, common approaches consist of cutting (with
restriction enzymes) or shearing (with mechanical forces) large DNA fragments into
shorter DNA fragments. The fragmented DNA may then be cloned into a DNA vector
and amplified in a bacterial host such as Escherichia coli. Short DNA fragments
purified from individual bacterial colonies are individually sequenced and assembled
electronically into one long, contiguous sequence. Studies have shown that adding a
size selection step to collect DNA fragments of uniform size can improve sequencing
efficiency and accuracy of the genome assembly. In these studies, automated sizing
has proven to be more reproducible and precise than manual gel sizing.[66][67][68]
The term "de novo sequencing" specifically refers to methods used to determine the
sequence of DNA with no previously known sequence. De novo translates from Latin
as "from the beginning". Gaps in the assembled sequence may be filled by primer
walking. The different strategies have different tradeoffs in speed and accuracy;
shotgun methods are often used for sequencing large genomes, but its assembly is
complex and difficult, particularly with sequence repeats often causing gaps in
genome assembly.
Most sequencing approaches use an in vitro cloning step to amplify individual DNA
molecules, because their molecular detection methods are not sensitive enough for
single molecule sequencing. Emulsion PCR[69] isolates individual DNA molecules
along with primer-coated beads in aqueous droplets within an oil phase. A
polymerase chain reaction (PCR) then coats each bead with clonal copies of the DNA
molecule followed by immobilization for later sequencing. Emulsion PCR is used in
the methods developed by Marguilis et al. (commercialized by 454 Life Sciences),
Shendure and Porreca et al. (also known as "polony sequencing") and SOLiD
sequencing, (developed by Agencourt, later Applied Biosystems, now Life
Technologies).[70][71][72] Emulsion PCR is also used in the GemCode and Chromium
platforms developed by 10x Genomics.[73]
Shotgun sequencing
Shotgun sequencing is a sequencing method designed for analysis of DNA
sequences longer than 1000 base pairs, up to and including entire chromosomes.
This method requires the target DNA to be broken into random fragments. After
sequencing individual fragments using the chain termination method, the sequences
can be reassembled on the basis of their overlapping regions.[74]
High-throughput methods
Multiple, fragmented sequence reads
must be assembled together on the
basis of their overlapping areas.
The high demand for low-cost sequencing has driven the development of high-
throughput sequencing technologies that parallelize the sequencing process,
producing thousands or millions of sequences concurrently.[76][77][78] High-throughput
sequencing technologies are intended to lower the cost of DNA sequencing beyond
what is possible with standard dye-terminator methods.[79] In ultra-high-throughput
sequencing as many as 500,000 sequencing-by-synthesis operations may be run in
parallel.[80][81][82] Such technologies led to the ability to sequence an entire human
genome in as little as one day.[83] As of 2019, corporate leaders in the development of
high-throughput sequencing products included Illumina, Qiagen and ThermoFisher
Scientific.[83]
Comparison of high-throughput sequencing methods[84][85]
Accuracy
(single
Method Read length
read not
consensus
Single-
molecule real-
30,000 bp 87% raw-
time
(N50); read
sequencing maximum read length
>100,000 bases[86][87][88] accuracy[89
(Pacific
Biosciences)
Ion
semiconductor up to 600
99.6%[95]
(Ion Torrent bp[94]
sequencing)
HiSeq X: 300 bp
BGISEQ-500, MGISEQ-
(cPAS- 2000: 50-300bp[97]
BGI/MGI)
Sequencing by
50+35 or
ligation (SOLiD 99.9%
50+50 bp
sequencing)
Around 150
GenapSys 99.9%
bp single-
Sequencing (Phred30)
end
Early industrial research into this method was based on a technique called
'exonuclease sequencing', where the readout of electrical signals occurred as
nucleotides passed by alpha(α)-hemolysin pores covalently bound with
cyclodextrin.[103] However the subsequent commercial method, 'strand sequencing',
sequenced DNA bases in an intact strand.
Two main areas of nanopore sequencing in development are solid state nanopore
sequencing, and protein based nanopore sequencing. Protein nanopore sequencing
utilizes membrane protein complexes such as α-hemolysin, MspA (Mycobacterium
smegmatis Porin A) or CssG, which show great promise given their ability to
distinguish between individual and groups of nucleotides.[104] In contrast, solid-state
nanopore sequencing utilizes synthetic materials such as silicon nitride and
aluminum oxide and it is preferred for its superior mechanical ability and thermal and
chemical stability.[105] The fabrication method is essential for this type of sequencing
given that the nanopore array can contain hundreds of pores with diameters smaller
than eight nanometers.[104]
The concept originated from the idea that single stranded DNA or RNA molecules can
be electrophoretically driven in a strict linear sequence through a biological pore that
can be less than eight nanometers, and can be detected given that the molecules
release an ionic current while moving through the pore. The pore contains a detection
region capable of recognizing different bases, with each base generating various time
specific signals corresponding to the sequence of bases as they cross the pore which
are then evaluated.[105] Precise control over the DNA transport through the pore is
crucial for success. Various enzymes such as exonucleases and polymerases have
been used to moderate this process by positioning them near the pore's entrance.[106]
Polony sequencing
The polony sequencing method, developed in the laboratory of George M. Church at
Harvard, was among the first high-throughput sequencing systems and was used to
sequence a full E. coli genome in 2005.[107] It combined an in vitro paired-tag library
with emulsion PCR, an automated microscope, and ligation-based sequencing
chemistry to sequence an E. coli genome at an accuracy of >99.9999% and a cost
approximately 1/9 that of Sanger sequencing.[107] The technology was licensed to
Agencourt Biosciences, subsequently spun out into Agencourt Personal Genomics,
and eventually incorporated into the Applied Biosystems SOLiD platform. Applied
Biosystems was later acquired by Life Technologies, now part of Thermo Fisher
Scientific.
454 pyrosequencing
A parallelized version of pyrosequencing was developed by 454 Life Sciences, which
has since been acquired by Roche Diagnostics. The method amplifies DNA inside
water droplets in an oil solution (emulsion PCR), with each droplet containing a single
DNA template attached to a single primer-coated bead that then forms a clonal
colony. The sequencing machine contains many picoliter-volume wells each
containing a single bead and sequencing enzymes. Pyrosequencing uses luciferase
to generate light for detection of the individual nucleotides added to the nascent DNA,
and the combined data are used to generate sequence reads.[70] This technology
provides intermediate read length and price per base compared to Sanger sequencing
on one end and Solexa and SOLiD on the other.[79]
In this method, DNA molecules and primers are first attached on a slide or flow cell
and amplified with polymerase so that local clonal DNA colonies, later coined "DNA
clusters", are formed. To determine the sequence, four types of reversible terminator
bases (RT-bases) are added and non-incorporated nucleotides are washed away. A
camera takes images of the fluorescently labeled nucleotides. Then the dye, along
with the terminal 3' blocker, is chemically removed from the DNA, allowing for the next
cycle to begin. Unlike pyrosequencing, the DNA chains are extended one nucleotide at
a time and image acquisition can be performed at a delayed moment, allowing for
very large arrays of DNA colonies to be captured by sequential images taken from a
single camera.
Decoupling the enzymatic reaction and the image capture allows for optimal
throughput and theoretically unlimited sequencing capacity. With an optimal
configuration, the ultimately reachable instrument throughput is thus dictated solely
by the analog-to-digital conversion rate of the camera, multiplied by the number of
cameras and divided by the number of pixels per DNA colony required for visualizing
them optimally (approximately 10 pixels/colony). In 2012, with cameras operating at
more than 10 MHz A/D conversion rates and available optics, fluidics and enzymatics,
throughput can be multiples of 1 million nucleotides/second, corresponding roughly
to 1 human genome equivalent at 1x coverage per hour per instrument, and 1 human
genome re-sequenced (at approx. 30x) per day per instrument (equipped with a single
camera).[111]
The two technologies that form the basis for this high-throughput sequencing
technology are DNA nanoballs (DNB) and patterned arrays for nanoball attachment to
a solid surface.[112] DNA nanoballs are simply formed by denaturing double stranded,
adapter ligated libraries and ligating the forward strand only to a splint
oligonucleotide to form a ssDNA circle. Faithful copies of the circles containing the
DNA insert are produced utilizing Rolling Circle Amplification that generates
approximately 300–500 copies. The long strand of ssDNA folds upon itself to
produce a three-dimensional nanoball structure that is approximately 220 nm in
diameter. Making DNBs replaces the need to generate PCR copies of the library on the
flow cell and as such can remove large proportions of duplicate reads, adapter-
adapter ligations and PCR induced errors.[114]
A BGI MGISEQ-2000RS sequencer
SOLiD sequencing
Library preparation for the SOLiD
platform
Microfluidic Systems
There are two main microfluidic systems that are used to sequence DNA; droplet
based microfluidics and digital microfluidics. Microfluidic devices solve many of the
current limitations of current sequencing arrays.
Abate et al. studied the use of droplet-based microfluidic devices for DNA
sequencing.[4] These devices have the ability to form and process picoliter sized
droplets at the rate of thousands per second. The devices were created from
polydimethylsiloxane (PDMS) and used Forster resonance energy transfer, FRET
assays to read the sequences of DNA encompassed in the droplets. Each position on
the array tested for a specific 15 base sequence.[4]
DNA sequencing research, using microfluidics, also has the ability to be applied to the
sequencing of RNA, using similar droplet microfluidic techniques, such as the
method, inDrops.[126] This shows that many of these DNA sequencing techniques will
be able to be applied further and be used to understand more about genomes and
transcriptomes.
Methods in development
DNA sequencing methods currently under development include reading the sequence
as a DNA strand transits through nanopores (a method that is now commercial but
subsequent generations such as solid-state nanopores are still in
development),[127][128] and microscopy-based techniques, such as atomic force
microscopy or transmission electron microscopy that are used to identify the
positions of individual nucleotides within long DNA fragments (>5,000 bp) by
nucleotide labeling with heavier elements (e.g., halogens) for visual detection and
recording.[129][130] Third generation technologies aim to increase throughput and
decrease the time to result and cost by eliminating the need for excessive reagents
and harnessing the processivity of DNA polymerase.[131]
The use of tunnelling currents has the potential to sequence orders of magnitude
faster than ionic current methods and the sequencing of several DNA oligomers and
micro-RNA has already been achieved.[134]
Sequencing by hybridization
Sequencing by hybridization is a non-enzymatic method that uses a DNA microarray. A
single pool of DNA whose sequence is to be determined is fluorescently labeled and
hybridized to an array containing known sequences. Strong hybridization signals from
a given spot on the array identifies its sequence in the DNA being sequenced.[135]
Microscopy-based techniques
This approach directly visualizes the sequence of DNA molecules using electron
microscopy. The first identification of DNA base pairs within intact DNA molecules by
enzymatically incorporating modified bases, which contain atoms of increased
atomic number, direct visualization and identification of individually labeled bases
within a synthetic 3,272 base-pair DNA molecule and a 7,249 base-pair viral genome
has been demonstrated.[145]
RNAP sequencing
This method is based on use of RNA polymerase (RNAP), which is attached to a
polystyrene bead. One end of DNA to be sequenced is attached to another bead, with
both beads being placed in optical traps. RNAP motion during transcription brings the
beads in closer and their relative distance changes, which can then be recorded at a
single nucleotide resolution The sequence is deduced based on the four readouts
with lowered concentrations of each of the four nucleotide types, similarly to the
Sanger method.[146] A comparison is made between regions and sequence
information is deduced by comparing the known sequence regions to the unknown
sequence regions.[147]
Market share
While there are many different ways to sequence DNA, only a few dominate the
market. In 2022, Illumina had about 80% of the market; the rest of the market is taken
by only a few players (PacBio, Oxford, 454, MGI)[149]
Sample preparation
The success of any DNA sequencing protocol relies upon the DNA or RNA sample
extraction and preparation from the biological material of interest.
A successful DNA extraction will
yield a DNA sample with long, non-
degraded strands.
A successful RNA extraction will
yield a RNA sample that should be
converted to complementary DNA
(cDNA) using reverse transcriptase
—a DNA polymerase that
synthesizes a complementary DNA
based on existing strands of RNA in
a PCR-like manner.[150]
Complementary DNA can then be
processed the same way as
genomic DNA.
After DNA or RNA extraction, samples may require further preparation depending on
the sequencing method. For Sanger sequencing, either cloning procedures or PCR are
required prior to sequencing. In the case of next-generation sequencing methods,
library preparation is required before processing.[151] Assessing the quality and
quantity of nucleic acids both after extraction and after library preparation identifies
degraded, fragmented, and low-purity samples and yields high-quality sequencing
data.[152]
Development initiatives
Each year the National Human Genome Research Institute, or NHGRI, promotes
grants for new research and developments in genomics. 2010 grants and 2011
candidates include continuing work in microfluidic, polony and base-heavy
sequencing methodologies.[154]
Computational challenges
The sequencing technologies described here produce raw data that needs to be
assembled into longer sequences such as complete genomes (sequence assembly).
There are many computational challenges to achieve this, such as the evaluation of
the raw sequence data which is done by programs and algorithms such as Phred and
Phrap. Other challenges have to deal with repetitive sequences that often prevent
complete genome assemblies because they occur in many places of the genome. As
a consequence, many sequences may not be assigned to particular chromosomes.
The production of raw sequence data is only the beginning of its detailed
bioinformatical analysis.[155] Yet new methods for sequencing and correcting
sequencing errors were developed.[156]
Read trimming
Sometimes, the raw reads produced by the sequencer are correct and precise only in
a fraction of their length. Using the entire read may introduce artifacts in the
downstream analyses like genome assembly, SNP calling, or gene expression
estimation. Two classes of trimming programs have been introduced, based on the
window-based or the running-sum classes of algorithms.[157] This is a partial list of
the trimming algorithms currently available, specifying the algorithm class they
belong to:
Read Trimming Algorithms
Ethical issues
Human genetics have been included within the field of bioethics since the early
1970s[164] and the growth in the use of DNA sequencing (particularly high-throughput
sequencing) has introduced a number of ethical issues. One key issue is the
ownership of an individual's DNA and the data produced when that DNA is
sequenced.[165] Regarding the DNA molecule itself, the leading legal case on this
topic, Moore v. Regents of the University of California (1990) ruled that individuals have
no property rights to discarded cells or any profits made using these cells (for
instance, as a patented cell line). However, individuals have a right to informed
consent regarding removal and use of cells. Regarding the data produced through
DNA sequencing, Moore gives the individual no rights to the information derived from
their DNA.[165]
As DNA sequencing becomes more widespread, the storage, security and sharing of
genomic data has also become more important.[165][166] For instance, one concern is
that insurers may use an individual's genomic data to modify their quote, depending
on the perceived future health of the individual based on their DNA.[166][167] In May
2008, the Genetic Information Nondiscrimination Act (GINA) was signed in the United
States, prohibiting discrimination on the basis of genetic information with respect to
health insurance and employment.[168][169] In 2012, the US Presidential Commission
for the Study of Bioethical Issues reported that existing privacy legislation for DNA
sequencing data such as GINA and the Health Insurance Portability and
Accountability Act were insufficient, noting that whole-genome sequencing data was
particularly sensitive, as it could be used to identify not only the individual from which
the data was created, but also their relatives.[170][171]
In most of the United States, DNA that is "abandoned", such as that found on a licked
stamp or envelope, coffee cup, cigarette, chewing gum, household trash, or hair that
has fallen on a public sidewalk, may legally be collected and sequenced by anyone,
including the police, private investigators, political opponents, or people involved in
paternity disputes. As of 2013, eleven states have laws that can be interpreted to
prohibit "DNA theft".[172]
Ethical issues have also been raised by the increasing use of genetic variation
screening, both in newborns, and in adults by companies such as 23andMe.[173][174] It
has been asserted that screening for genetic variations can be harmful, increasing
anxiety in individuals who have been found to have an increased risk of disease.[175]
For example, in one case noted in Time, doctors screening an ill baby for genetic
variants chose not to inform the parents of an unrelated variant linked to dementia
due to the harm it would cause to the parents.[176] However, a 2011 study in The New
England Journal of Medicine has shown that individuals undergoing disease risk
profiling did not show increased levels of anxiety.[175] Also, the development of Next
Generation sequencing technologies such as Nanopore based sequencing has also
raised further ethical concerns.[177]
See also
Bioinformatics – Computational
analysis of large, complex sets of
biological data
Cancer genome sequencing
Circular consensus sequencing
DNA computing – Computing using
molecular biology hardware
DNA field-effect transistor –
transistor which uses the field-
effect due to the partial charges of
DNA
DNA sequencing theory –
Biological theory
DNA sequencer – A scientific
instrument used to automate the
DNA sequencing process
Genographic Project – Citizen
science project
Genome project – Scientific
endeavours to determine the
complete genome sequence of an
organism
Genome sequencing of endangered
species – DNA testing for
endangerment assessment
Genome skimming – Method of
genome sequencing
IsoBase – Functionally related
proteins across PPI networks
Linked-read sequencing
Jumping library
Nucleic acid sequence –
Succession of nucleotides in a
nucleic acid
Multiplex ligation-dependent probe
amplification
Personalized medicine – Medical
model that tailors medical
practices to the individual patient
Protein sequencing – Sequencing
of amino acid arrangement in a
protein
Sequence mining
Sequence profiling tool
Sequencing by hybridization –
method for determining the
constituent nucleotides of a fixed
size in a strand of DNA
Sequencing by ligation – DNA
sequencing method that uses the
enzyme DNA ligase to identify the
nucleotide present at a given
position in a DNA sequence
TIARA (database) – Database of
personal genomics information
Transmission electron microscopy
DNA sequencing – Single-molecule
sequencing technology
Notes
1. "Next-generation" remains in
broad use as of 2019. For
instance, Straiton J, Free T,
Sawyer A, Martin J (February
2019). "From Sanger Sequencing
to Genome Databases and
Beyond" (https://doi.org/10.214
4%2Fbtn-2019-0011) .
BioTechniques. 66 (2): 60–63.
doi:10.2144/btn-2019-0011 (http
s://doi.org/10.2144%2Fbtn-2019-
0011) . PMID 30744413 (https://p
ubmed.ncbi.nlm.nih.gov/307444
13) . "Next-generation
sequencing (NGS) technologies
have revolutionized genomic
research. (opening sentence of
the article)"
References
7. Pettersson E, Lundeberg J,
Ahmadian A (February 2009).
"Generations of sequencing
technologies" (https://doi.org/10.
1016%2Fj.ygeno.2008.10.003) .
Genomics. 93 (2): 105–11.
doi:10.1016/j.ygeno.2008.10.003
(https://doi.org/10.1016%2Fj.yge
no.2008.10.003) .
PMID 18992322 (https://pubmed.
ncbi.nlm.nih.gov/18992322) .
8. Jay E, Bambara R, Padmanabhan
R, Wu R (March 1974). "DNA
sequence analysis: a general,
simple and rapid method for
sequencing large
oligodeoxyribonucleotide
fragments by mapping" (https://w
ww.ncbi.nlm.nih.gov/pmc/article
s/PMC344020) . Nucleic Acids
Research. 1 (3): 331–53.
doi:10.1093/nar/1.3.331 (https://
doi.org/10.1093%2Fnar%2F1.3.3
31) . PMC 344020 (https://www.n
cbi.nlm.nih.gov/pmc/articles/PM
C344020) . PMID 10793670 (http
s://pubmed.ncbi.nlm.nih.gov/107
93670) .
Portal: Biology
Retrieved from
"https://en.wikipedia.org/w/index.php?
title=DNA_sequencing&oldid=1242799157"