Algorithms For Next Generation Sequencing 1st Edition Sung All Chapter Instant Download
Algorithms For Next Generation Sequencing 1st Edition Sung All Chapter Instant Download
com
https://ebookgate.com/product/algorithms-for-next-
generation-sequencing-1st-edition-sung/
OR CLICK HERE
DOWLOAD NOW
https://ebookgate.com/product/next-generation-sequencing-current-
technologies-and-applications-jianping-xu-editor/
ebookgate.com
https://ebookgate.com/product/hacking-the-next-generation-1st-edition-
nitesh-dhanjani/
ebookgate.com
https://ebookgate.com/product/fandom-the-next-generation-1st-edition-
bridget-kies/
ebookgate.com
https://ebookgate.com/product/chemical-and-biochemical-catalysis-for-
next-generation-biofuels-1st-edition-blake-a-simmons/
ebookgate.com
Next generation of data mining 1st Edition Hillol Kargupta
https://ebookgate.com/product/next-generation-of-data-mining-1st-
edition-hillol-kargupta/
ebookgate.com
https://ebookgate.com/product/business-strategies-for-the-next-
generation-network-informa-telecoms-media-1st-edition-nigel-seel/
ebookgate.com
https://ebookgate.com/product/next-generation-excel-modeling-in-excel-
for-analysts-and-mbas-1st-edition-isaac-gottlieb/
ebookgate.com
https://ebookgate.com/product/the-next-generation-of-corporate-
universities-mark-allen/
ebookgate.com
ALGORITHMS FOR
NEXT-GENERATION SEQUENCING
ALGORITHMS FOR
NEXT-GENERATION SEQUENCING
Wing-Kin Sung
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface xi
1 Introduction 1
1.1 DNA, RNA, protein and cells . . . . . . . . . . . . . . . . . . 1
1.2 Sequencing technologies . . . . . . . . . . . . . . . . . . . . . 3
1.3 First-generation sequencing . . . . . . . . . . . . . . . . . . . 4
1.4 Second-generation sequencing . . . . . . . . . . . . . . . . . 6
1.4.1 Template preparation . . . . . . . . . . . . . . . . . . 6
1.4.2 Base calling . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.3 Polymerase-mediated methods based on reversible
terminator nucleotides . . . . . . . . . . . . . . . . . . 7
1.4.4 Polymerase-mediated methods based on unmodified
nucleotides . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.5 Ligase-mediated method . . . . . . . . . . . . . . . . . 11
1.5 Third-generation sequencing . . . . . . . . . . . . . . . . . . 12
1.5.1 Single-molecule real-time sequencing . . . . . . . . . . 12
1.5.2 Nanopore sequencing method . . . . . . . . . . . . . . 13
1.5.3 Direct imaging of DNA using electron microscopy . . 15
1.6 Comparison of the three generations of sequencing . . . . . . 16
1.7 Applications of sequencing . . . . . . . . . . . . . . . . . . . 17
1.8 Summary and further reading . . . . . . . . . . . . . . . . . 19
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
v
vi Contents
8 RNA-seq 245
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.2 High-throughput methods to study the transcriptome . . . . 247
8.3 Application of RNA-seq . . . . . . . . . . . . . . . . . . . . . 248
8.4 Computational Problems of RNA-seq . . . . . . . . . . . . . 250
8.5 RNA-seq read mapping . . . . . . . . . . . . . . . . . . . . . 250
8.5.1 Features used in RNA-seq read mapping . . . . . . . . 250
8.5.1.1 Transcript model . . . . . . . . . . . . . . . . 250
8.5.1.2 Splice junction signals . . . . . . . . . . . . . 252
8.5.2 Exon-first approach . . . . . . . . . . . . . . . . . . . 253
8.5.3 Seed-and-extend approach . . . . . . . . . . . . . . . . 256
8.6 Construction of isoforms . . . . . . . . . . . . . . . . . . . . 260
8.7 Estimating expression level of each transcript . . . . . . . . . 261
8.7.1 Estimating transcript abundances when every read
maps to exactly one transcript . . . . . . . . . . . . . 261
8.7.2 Estimating transcript abundances when a read maps to
multiple isoforms . . . . . . . . . . . . . . . . . . . . . 264
8.7.3 Estimating gene abundance . . . . . . . . . . . . . . . 266
x Contents
References 307
Index 339
Preface
xi
xii Preface
for describing the alignments of the NGS reads on the reference genome. BED,
VCF and WIG formats are annotation formats.
To develop methods for processing NGS data, we need efficient algorithms
and data structures. Chapter 3 is devoted to briefly describing these tech-
niques.
Chapter 4 studies read mappers. Read mappers align the NGS reads on
the reference genome. The input is a set of raw reads in fasta or fastq files.
The read mapper will align each raw read on the reference genome, that is,
identify the region in the reference genome which is highly similar to the read.
Then, the read mapper will output all these alignments in a SAM or BAM
file. This is the basic step for many NGS applications. (It is the first step for
the methods in Chapters 6−9.)
Chapter 5 studies the de novo assembly problem. Given a set of raw reads
extracted from whole genome sequencing of some sample genome, de novo
assembly aims to stitch the raw reads together to reconstruct the genome.
It enables us to reconstruct novel genomes like plants and bacteria. De novo
assembly involves a few steps: error correction, contig assembly (de Bruijn
graph approach or base-by-base extension approach), scaffolding and gap fill-
ing. This chapter describes techniques developed for these steps.
Chapter 6 discusses the problem of identifying single nucleotide variations
(SNVs) and small insertions/deletions (indels) in an individual genome. The
genome of every individual is highly similar to the reference human genome.
However, each genome is still different from the reference genome. On average,
there is 1 single nucleotide variation in every 3000 bases and 1 small indel in
every 1000 bases. To discover these variations, we can first perform whole
genome sequencing or exome sequencing of the individual genome to obtain
a set of raw reads. After aligning the raw reads on the reference genome, we
use SNV callers and indel callers to call SNVs and small indels. This chapter
is devoted to discussing techniques used in SNV callers and indel callers.
Apart from SNVs and small indels, copy number variations (CNVs) and
structural variations (SVs) are the other types of variations that appear in our
genome. CNVs and SVs are not as frequent as SNVs and indels. Moreover, they
are more prone to change the phenotype. Hence, it is important to understand
them. Chapter 7 is devoted to studying techniques used in CNV callers and
SV callers.
All above technologies are related to genome sequencing. We can also se-
quence RNA. This technology is known as RNA-seq. Chapter 8 studies meth-
ods for analyzing RNA-seq. By applying computational methods on RNA-seq,
we can recover the transcriptome. More precisely, RNA-seq enables us to iden-
tify exons and split junctions. Then, we can predict the isoforms of the genes.
We can also determine the expression of each transcript and each gene.
By combining Chromatin immunoprecipitation and next-generation se-
quencing, we can sequence genome regions that are bound by some transcrip-
tion factors or with epigenetic marks. Such technology is known as ChIP-
seq. The computational methods that identify those binding sites are known
Preface xiii
Wing-Kin Sung
Chapter 1
Introduction
1 The actual term “genomics” is thought to have been coined by Dr. Tom Roderick, a
geneticist at the Jackson Laboratory (Bar Harbor, ME) at a meeting held in Maryland on
the mapping of the human genome in 1986.
1
2 Algorithms for Next-Generation Sequencing
50 − A C G T A G C T −30
|| ||| ||| || || ||| ||| ||
30 − T G C A T C G A −50
FIGURE 1.1: The double-stranded DNA. The two strands show a comple-
mentary base pairing.
3. Separation by electrophoresis.
Step 1 amplifies the DNA template. The DNA template is inserted into
the plasmid vector; then the plasmid vector is inserted into the host cells for
cloning. By growing the host cells, we obtain many copies of the same DNA
template.
Step 2 generates all possible prefixes of the DNA template. Two tech-
niques have been proposed for this step: (1) the Maxam-Gilbert technique [194]
and (2) the chain termination methodology (Sanger method) [259, 260]. The
Maxam-Gilbert technique relies on the cleaving of nucleotides by chemical.
Four different chemicals are used and generate all sequences ending with A, C, G
and T, respectively. This allows us to generate all possible prefixes of the tem-
plate. This technique is most efficient for short DNA sequences. However, it
is considered unsafe because of the extensive use of toxic chemicals.
The chain termination methodology (Sanger method) is a better alter-
native. Given a single-stranded DNA template, the method performs DNA
polymerase-dependent synthesis in the presence of (1) natural deoxynu-
cleotides (dNTPs) and (2) dideoxynucleotides (ddNTPs). ddNTPs serve as
non-reversible synthesis terminators (see Figure 1.2(a,b)). The DNA synthesis
reaction is randomly terminated whenever a ddNTP is added to the growing
oligonucleotide chain, resulting in truncated products of varying lengths with
an appropriate ddNTP at their 3’ terminus.
After we obtain all possible prefixes of the DNA template, the product is
a mixture of DNA fragments of different lengths. We can separate these DNA
Introduction 5
C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ dATP + H+ + PPi
(a)
C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ ddATP + H+ + PPi
(b)
FIGURE 1.2: (a) The chemical reaction for the incorporation of dATP into
the growing DNA strand. (b) The chemical reaction for the incorporation of
ddATP into the growing DNA strand. The vertical bar behind A indicates
that the extension of the DNA strand is terminated.
3’-GCATCGGCATATG...-5’
5’-CGTA
CGTA G - +
CGTAG C
CGTAGC C
CGTAGCC G
CGTAGCCG T
DNA Insert Insert
CGTAGCCGT A
template into into GCCGTATAC
CGTAGCCGTA T
vector host cell Cloning CGTAGCCGTAT A
CGTAGCCGTATA C Electrophoresis
Cyclic sequencing & readout
fragments by their lengths using gel electrophoresis (Step 3). Gel electrophore-
sis is based on the fact that DNA is negatively charged. When an electrical
field is applied to a mixture of DNA on a gel, the DNA fragments will move
from the negative pole to the positive pole. Due to friction, short DNA frag-
ments travel faster than long DNA fragments. Hence, the gel electrophoresis
separates the mixture into bands, each containing DNA molecules of the same
length.
Using the fluorescent tags attached to the terminal ddNTPs (we have
4 different colors for the 4 different ddNTPs), the DNA fragments ending
with different nucleotides will be labeled with different fluorescent dyes. By
detecting the light emitted from different bands, the DNA sequence of the
template will be revealed (Step 4).
In summary, the Sanger method can generate sequences of length ∼800 bp.
The process can be fully automated and hence it was a popular DNA sequenc-
6 Algorithms for Next-Generation Sequencing
Given a set of DNA fragments, the template preparation step first gener-
ates a DNA template for each DNA fragment. The DNA template is created
by ligating adaptor sequences to the two ends of the target DNA fragment (see
Figure 1.4(a)). Then, the templates are amplified using PCR. There are two
common methods for amplifying the templates: (1) emulsion PCR (emPCR)
and (2) solid-phase amplification (Bridge PCR).
emPCR amplifies each DNA template by a bead. First of all, one piece of
DNA template and a bead are inserted within a water drop in oil. The surface
of every bead is coated with a primer corresponding to one type of adaptor.
The DNA template will hybridize with one primer on the surface of the bead.
Then, it is PCR amplified within a water drop in oil. Figure 1.4(b) illustrates
the emPCR. emPCR is used by 454, Ion Torrent and SOLiD.
For bridge PCR, the amplification is done on a flat surface (say, glass),
which is coated with two types of primers, corresponding to the adaptors.
Each DNA template is first hybridized to one primer on the flat surface.
Amplification proceeds in cycles, with one end of each bridge tethered to the
surface. Figure 1.4(c) illustrates the bridge PCR process. Bridge PCR is used
by Illumina.
Although PCR can amplify DNA templates, there is amplification bias.
Experiments revealed that templates that are AT-rich or GC-rich have a lower
amplification efficient. This limitation creates uneven sequencing of the DNA
templates in the sample.
Introduction 7
(a)
templates
beads
(c)
FIGURE 1.4: (a) From the DNA fragments, DNA template is created by
attaching the two ends with adaptor sequences. (b) Amplifying the template
by emPCR. (c) Amplifying the template by bridge PCR.
PCR clone
C C C
T C
T T T
G G G
C G
C C C
A A AC
T A
C C C C
T T T G T
G G G C G
C C C A C
A A A A
(a)
G
C C G C G C
T T T T ……
G G G G
Add After Repeat the
C C C C
reversible scanning, steps to
A A A A
terminator reverse the sequence
dGTP termination other bases
Wash &
scan
(b)
A C
C G
T T
G A
T C
(a)
(b)
A C G T A C G T
(c)
6
5
intensity
4
3
2
1
ACGTACGTACGTACGTACGT
a high-density array of wells, and each well contains one template. In each
iteration, a single type of dNTP flows across the wells. If the dNTP is comple-
mentary to the template, polymerase will extend by one base and relax H+.
The relaxation of H+ changes the pH of the solution in the well and an IS-
FET sensor at the bottom of the well measures the pH change and converts it
into electric signals [251]. The sensor avoids the use of optical measurements,
which require a complicated camera and laser. This is the main difference
between Ion Torrent sequencing and 454 sequencing. The unattached dNTP
molecules are washed out before the next iteration. By interpreting the flow-
gram obtained from the ISFET sensor, we can recover the sequences of the
templates.
Since the method used by Ion Torrent is similar to that of Roche 454, it
also has the disadvantage that it cannot distinguish long homopolymers.
A C G T
A 0 1 2 3
C 1 0 3 2
G 2 3 0 1
T 3 2 1 0
12 Algorithms for Next-Generation Sequencing
• Nanopore-sequencing technologies
Immobilized
polymerase
of length up to 20, 000 bp, with an average read length of about 10, 000 bp.
Another advantage of PacBio RS is that it can sequence methylation status
simultaneously.
However, PacBio sequencing is more costly. It is about 3 − 4 times more
expensive than short read sequencing. Also, PacBio RS has a high error rate,
up to 17.9% errors [46]. The majority of the errors are indel errors [71]. Luckily,
the error rate is unbiased and almost constant throughout the entire read
length [146]. By repeatedly sequencing the same DNA template, we can reduce
the error rate.
flow through the pore continuously. As illustrated in Figure 1.9, DNA material
is placed in the top chamber. The positive charge draws a strand of DNA
moving from the top chamber to the bottom chamber flowing through the
nanopore. By detecting the difference in the electrical conductivity in the
pore, the DNA sequence is decoded. (Note that IBM’s DNA transistor is a
prototype which uses a similar idea.)
The approach has difficulty in calling the individual base accurately. In-
stead, Oxford nanopore technology will read the signal of k (say 5) bases
in each round. Then, using a hidden Markov model, the DNA base can be
decoded base by base.
Oxford nanopore technology has announced two sequencers: MiniION and
GridION. MiniION is a disposable USB-key sequencer. GridION is an ex-
pandable sequencer. Oxford nanopore technology claimed that GridION can
sequence 30x coverage of a human genome in 6 days at US$2200 − $3600. It
has the potential to decode a DNA fragment of length 100, 000 bp. Its cost is
about US$25−$40 per gigabyte. Although it is not expensive, the error rate is
about 17.8% (4.9% insertion error, 7.8% deletion error and 5.1% substitution
error) [115].
Unlike Oxford nanopore technology, Genia suggested combining nanopore
and the DNA polymerase to sequence a single-strand DNA template. In Genia,
the DNA polymerase is tethered with a biological nanopore. When a DNA
template gets in touch with the DNA polymerase, DNA synthesis happens
with four engineered nucleotides for A, C, G and T , each attached with a
different short tag. When a nucleotide is incorporated into the DNA template,
the tag is cleaved and it will travel through the biological nanopore and an
electric signal is measured. Since different nucleotides have different tags, we
can reconstruct the DNA template by measuring the electric signals.
NABsys is another nanopore sequencer. It first chops the genome into DNA
fragments of length 100, 000 bp. The DNA fragments are hybridized with a
particular probe so that specific short DNA sequences on the DNA fragments
Introduction 15
(a) (b)
… …
(c)
are bounded by the probes. Those DNA fragments with bound probes are
driven through a nanopore (see Figure 1.10(a)), creating a current-versus-
time tracing. The trace gives the position of the probes on the fragment.
(See Figure 1.10(b).) We can align the fragments based on their inter-probe
distance; then, we obtain a probe map for the genome (see Figure 1.10(c)).
We can obtain the probe maps for different probes. By aligning all of them,
we obtain the whole genome.
Unlike Genia, Oxford nanopore technology and the IBM DNA transis-
tor, NABsys does not require a very accurate current measurement from the
nanopore. The company claims that this method is cheap, and that read length
is long and fast. Furthermore, it is accurate!
$10,000,000.00
$1,000,000.00
$100,000.00
$10,000.00
$1,000.00
$100.00
$10.00
$1.00
$0.10
$0.01
$0.00
Sep-01
Jan-02
Sep-02
Sep-03
Sep-04
Sep-11
May-02
Jan-03
May-03
Jan-04
May-04
Jan-05
Sep-05
May-05
Jan-06
Sep-06
May-06
Jan-07
Sep-07
May-07
Jan-08
Sep-08
Sep-09
May-08
Jan-09
May-09
Jan-10
Sep-10
May-10
Jan-11
May-11
Jan-12
Sep-12
Sep-13
May-12
Jan-13
May-13
Jan-14
Sep-14
May-14
Jan-15
Sep-15
May-15
Cost per Mb of DNA bases Cost per Genome
FIGURE 1.11: The sequencing cost over time. There are two curves. The
blue curve shows the sequencing cost per million of sequencing bases while
the red curve shows the sequencing cost per human genome. (Data is obtained
from http://www.genome.gov/sequencingcosts.)
ing has been applied to many other research areas, including metagenomics,
3D modeling of the genome, etc.
1.9 Exercises
1. Consider the DNA sequence 5’-ACTCAGTTCG-3’. What is its reverse
complement? The SOLiD sequencer will output color-based sequences.
What is the expected color-based sequence for the above DNA sequence
and its reverse complement? Do you observe an interesting property?
2. Should we always use second- or third- generation sequencing instead of
first-generation sequencing? If not, when should we use Sanger sequenc-
ing?
Chapter 2
NGS file formats
2.1 Introduction
NGS technologies are widely used now. To facilitate NGS data analysis
and NGS data transfer, a few NGS file formats are defined. This chapter gives
an overview of these commonly used file formats.
First, we briefly describe the NGS data analysis process. After a sample is
sequenced using a NGS sequencer, some reads (i.e., raw DNA sequences) are
generated. These raw reads are stored in raw read files, which are in the fasta,
fastq, fasta.gz or fastq.gz format. Then, these raw reads are aligned on the
reference genome (such as the human genome). The alignments of the reads
are stored in alignment files, which are in the SAM or BAM format.
From these alignment files, downstream analysis can be performed to un-
derstand the sample. For example, if the alignment files are used to call mu-
tations (like single nucleotide variants), we will obtain variant files, which are
in the VCF or BCF format. If the alignment files are used to obtain the read
density (like copy number of each genomic region), we will obtain density files,
which are in the Wiggle, BedGraph, BigWig or cWig format. If the alignment
files are used to define regions with read coverage (like regions with RNA
transcripts or regions with TF binding sites), we will obtain annotation files,
which are in the bed or bigBed format.
Figure 2.1 illustrates the relationships among these NGS file formats. In
the rest of the chapter, we detail these file formats.
VCF or BCF
Wig,
BedGraph or
BigWig, cWig
21
22 Algorithms for Next-Generation Sequencing
ACTCAGCACCTTACGGCGTGCATCATCACATCAGGGACATACCAATACGGACAACCATCCCAAATATATTACGTTCTGAACGGCAGTACAAACT
(a)
ACTCAGCACCTTACGGCGTGCATCA
(b)
ACTCAGCACCTTACGGCGTGCATCA TACGTTCTGAACGGCAGTACAAACT
(c)
FIGURE 2.2: (a) An example DNA fragment. (b) A single-end read ex-
tracted from the DNA fragment in (a). (c) A paired-end read extracted from
the DNA fragment in (a).
Q = −10 ∗ log10 P.
>seq1 @seq1
ACTCAGCACCTTACGGCGTGCATCA ACTCAGCACCTTACGGCGTGCATCA
>seq2 +seq1
CCGTACCGTTGACAGATGGTTTACA !’’**()%A54l;djgp0i345adn
@seq2
CCGTACCGTTGACAGATGGTTTACA
+seq2
#$SGl2j;askjpqo2i3nz!;lak
(a) (b)
FIGURE 2.3: (a) is an example fasta file while (b) is an example fastq file.
Both (a) and (b) describe the same DNA sequences. Moreover, the fastq file
also stores the quality scores. Note that seq1 is the read obtained from the
DNA fragment in Figure 2.2(b).
X and y
index number for a
Flowcell lane coordinate of
multiplexed sample
the cluster
Instrument name Tile number (#0 for no indexing)
within the tile
FIGURE 2.4: An example of the read identifier in the fasta or fastq file.
24 Algorithms for Next-Generation Sequencing
with + and (4) the sequence of quality scores for the read (which is of the
same length as the read).
The above description states the format for storing the single-end reads.
For paired-end reads, we use two fastq files (suffixed with 1.fastq and 2.fastq).
The two files have the same number of reads. The ith paired-end read is
formed by the ith read in the first file and the ith read in the second
file. Note that the ith read in the second file is the reverse complement.
For example, for the paired-end read in Figure 2.2(c), the sequence in the
first file is ACTCAGCACCTTACGGCGTGCATCA while the sequence in the second
file is AGTTTGTACTGCCGTTCAGAACGTA (which is the reverse complement of
TACGTTCTGAACGGCAGTACAAACT).
Given the raw read data, we usually need to perform a quality
check prior to further analysis. The quality check not only can tell
us the quality of the sequencing library, it also enables us to fil-
ter reads that have low complexity and to trim some positions of
a read that have a high chance of sequencing error. A number of
tools can be used to check quality. They include SolexaQA[57], FastQC
(http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) and PRINSEQ[265].
1 2 3 4
1234567 8901234567890123456789012345678901234567
ref: ATCGAAC**TGACTGGACTAGAACCGTGAATCCACTGATCTAGACTCGA
+r1/1 TCG-ACGGTGACTG
+r2 GAAC-GTGACaactg
-r3 GCCTGGttcac
-r1/2 CCGTGAATC
+r3 GTGAAccaggc
+r4 ATCC……………………AGACTCGA
(a)
QNAME FLAG RNAME POS MAPQ CIGAR MRNM MPOS TLEN SEQ QUAL TAG:TYPE:VAL
(b)
FIGURE 2.5: (a) An example of alignments of a few short reads. (b) The
SAM file for the corresponding alignments is shown in (a).
fields can be appended after these mandatory fields. For each field, if its value
is not available, we set it to *. We briefly describe the 11 mandatory fields
here.
• QNAME is the name of the aligned read. (In Figure 2.5, the read names
are r1, r2, r3, and r4. Note that r3 has two alignments. Hence, there
are two rows with r3. r1 is a paired-end read. Hence, there are two rows
for the two reads of r1.)
• CIGAR is a string that describes the alignment between the read and
the reference sequence (see Section 2.3.2 for detail).
• MRNM is the name of the reference sequence of its mate read (“=” if it
is the same as RNAME). For single-end read, its value is “*” by default.
26 Algorithms for Next-Generation Sequencing
• MPOS is the leftmost position of its mate read. For a single-end read,
its value is zero by default.
• TLEN is the inferred signed observed template length (i.e., the inferred
insert size for paired-end read). For a single-end read, its value is zero
by default.
Figure 2.5(b) gives the SAM file for the example alignments in Fig-
ure 2.5(a). The following subsections give more information for FLAG and
CIGAR.
Note that SAM and BAM use two different coordinate systems. SAM uses
the 1-based coordinate system. The first base of a sequence is 1. A region is
specified by a close interval. For example, the region between the 3rd base
and the 6th base inclusive is [3, 6]. BAM uses the 0-based coordinate system.
The first base of a sequence is 0. A region is specified by a half-closed, half-
open interval. For example, the region between the 3rd base and the 6th base
inclusive is [3, 7).
To manipulate the information from SAM and BAM files, we use sam-
tools [169] and bamtools [16]. Samtools is a set of tools that allows us to
interactively view the alignments in both SAM and BAM files. It also al-
lows us to post-process and extracts information from both SAM and BAM
files. Bamtools provides a set of C++ API for us to access and manage the
alignments in the BAM file.
2.3.1 FLAG
The bitwise FLAG is a 16-bit integer describing the characteristics of the
alignment. Its binary format is b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 . Only
bits b0 , . . . , b11 are used by SAM. The meaning of each bit bi is described in
Table 2.1.
For example, for the first read of r1 in Figure 2.5(b), its flag is 99 =
0000000001100011. b0 , b1 , b5 and b6 are ones. This means that it has multiple
segments (b0 ), the segment is properly aligned (b1 ), the next segment maps
on the reverse complement (b5 ), and this is the first read in the pair (b6 ).
TABLE 2.1: The meaning of the twelve bits in the FLAG field.
Bit Description
b0 = 1 if the read is paired
b1 = 1 if the segment is a proper pair
b2 = 1 if the segment is unmapped
b3 = 1 if the mate is unmapped
b4 = 1 if the read maps on the reverse complement
b5 = 1 if the mate maps on the reverse complement.
b6 = 1 if this is the first read in the pair
b7 = 1 if this is the second read in the pair
b8 = 1 if this is a secondary alignment (i.e., the alternative alignment
when multiple alignments exist)
b9 = 1 if this alignment does not pass the quality control
b10 = 1 if it is a PCR duplicate
b11 = 1 if it is a supplementary alignment (part of a chimeric align-
ment)
Op Description
M alignment match (can be a se-
quence match or mismatch)
I insertion to the reference
D deletion from the reference
N skipped region from the reference
S soft clipping (clipped sequences
present in SEQ)
H hard clipping (clipped sequences
NOT present in SEQ)
P padding (silent deletion from
padded reference)
= sequence match
X sequence mismatch
Bed file:
FIGURE 2.6: The gene VHL has two splicing variants. One of them has 3
exons while the other one has 2 exons (is missing the middle exon). The solid
bars are the exons. The thick solid bars are the coding regions of VHL. The
bed file corresponding to these two isoforms is shown at the bottom of the
figure.
NGS file formats 29
##fileformat=VCFv4.2
(b) ##fileDate=20110705
##source=VCFtools
##reference=NCBI36
##ALT=<ID=DEL,Description="Deletion">
##FILTER=<ID=q10,Description="Quality below 10">
##INFO=<ID=SVTYPE,umber=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality (phred score)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2
1 8 . G C . PASS . GT:DP 1/1:3 0/0:2
1 25 . T TAA,TT . q10 . GT:DP 1/1:1 1/2:3
1 40 . TGAC T . PASS . GT:GQ 1/1:50 0/0:70
1 55 . T <DEL> . PASS SVTYPE=DEL;END=69 GT 1/1 1/1
FIGURE 2.7: Consider two samples. Each sample has a set of length-12
reads. Sample (a) shows the alignment of those reads on the reference genome.
There are 4 regions with genomic variations. At position 8, an SNV appears in
sample 1. At position 25, two different insertions appear in both samples. At
position 40, a deletion appears in sample 2. At position 55, a deletion appears
in both samples. Sample (b) shows the VCF file that describes the genomic
variations in the 4 loci.
40
30
20
10
1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
and bcftools [163]. (Note that VCF files use the 1-based coordinate system
while BCF files use the 0-based coordinate system.)
which spans s bases. The following box gives the wiggle file for Figure 2.8.
(Note that the wiggle file uses a 1-based coordinate system.)
chr1 9 14 10
chr1 14 19 20
chr1 19 24 30
chr1 24 29 20
chr1 29 34 30
chr1 34 39 20
chr1 39 44 10
chr1 49 54 20
chr1 59 64 30
chr1 64 69 40
chr1 69 74 20
chr1 79 84 40
However, wiggle and bedGraph are uncompressed text formats and they
are usually huge. Hence, when we want to represent a genome-wide density
profile, we usually use the compressed version, which is in the bigWig for-
NGS file formats 33
mat [135] or cWig format [112]. Both bigWig and cWig support efficient
queries over any selected interval. Precisely, they support 4 operations. Let
the density values of the N bases be r1 , . . . , rN . For any chromosome interval
p..q, the four operations are:
1
PN
• mean(p, q): The mean of the density values in p..q, that is, N i=1 ri .
• minV al(p, q) and maxV al(p, q): The minimum and maximum of the
density values in p..q, that is, mini=1..N ri and maxi=1..N ri .
• stdev(p,
q P q): The standard derivation of the density values in p..q, that
1 N
is, N i=1 (rk − mean(p, q))2 .
2.7 Exercises
1. Suppose chromosome 2 is the following sequence ACACGACTAA . . ..
(a) Convert the following set of intervals from the 0-based coordi-
nate format to the 1-based coordinate format: 3..100, 0..89 and
1000..2000.
(b) Convert the following set of intervals from the 1-based coordi-
nate format to the 0-based coordinate format: 3..100, 1..89 and
1000..2000.
3. Given a BAM file input.bam, we want to find all alignments with maQ >
0 using samtools. What should be the command?
34 Algorithms for Next-Generation Sequencing
4. Given two BED files input1.bed and input2.bed, we want to find all
genomic regions in input1.bed that overlap with some genomic regions
in input2.bed. What should be the command?
5. For the following wiggle file, can you compute coverage(3, 8),
mean(3, 8), minV al(3, 8), maxV al(3, 8) and stdev(3, 8)?
6. Can you propose a script to convert a BAM file into a bigWig file?
Chapter 3
Related algorithms and data
structures
3.1 Introduction
This chapter discusses various algorithmic techniques and data structures
used in this book. They include:
• Recursion and dynamic programming (Section 3.2)
• Parameter estimation and the EM algorithm (Section 3.3)
• Hash data structures (Section 3.4)
• Full-text index (Section 3.5)
• Data compression techniques (Section 3.6)
35
Another Random Scribd Document
with Unrelated Content
The Project Gutenberg eBook of Mrs Dalloway
in Bond Street
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.
Language: English
DIAL
VOLUME LXXV
BY VIRGINIA WOOLF
Mrs Dalloway said she would buy the gloves herself.
Big Ben was striking as she stepped out into the street. It was
eleven o'clock and the unused hour was fresh as if issued to children
on a beach. But there was something solemn in the deliberate swing
of the repeated strokes; something stirring in the murmur of wheels
and the shuffle of footsteps.
No doubt they were not all bound on errands of happiness. There is
much more to be said about us than that we walk the streets of
Westminster. Big Ben too is nothing but steel rods consumed by rust
were it not for the care of H. M.'s Office of Works. Only for Mrs
Dalloway the moment was complete; for Mrs Dalloway June was
fresh. A happy childhood—and it was not to his daughters only that
Justin Parry had seemed a fine fellow (weak of course on the
Bench); flowers at evening, smoke rising; the caw of rooks falling
from ever so high, down down through the October air—there is
nothing to take the place of childhood. A leaf of mint brings it back;
or a cup with a blue ring.
Poor little wretches, she sighed, and pressed forward. Oh, right
under the horses' noses, you little demon! and there she was left on
the kerb stretching her hand out, while Jimmy Dawes grinned on the
further side.
A charming woman, poised, eager, strangely white-haired for her
pink cheeks, so Scope Purvis, C. B., saw her as he hurried to his
office. She stiffened a little, waiting for Durtnall's van to pass. Big
Ben struck the tenth; struck the eleventh stroke. The leaden circles
dissolved in the air. Pride held her erect, inheriting, handing on,
acquainted with discipline and with suffering. How people suffered,
how they suffered, she thought, thinking of Mrs Foxcroft at the
Embassy last night decked with jewels, eating her heart out,
because that nice boy was dead, and now the old Manor House
(Durtnall's van passed) must go to a cousin.
"Good morning to you!" said Hugh Whitbread raising his hat rather
extravagantly by the china shop, for they had known each other as
children. "Where are you off to?"
"I love walking in London" said Mrs Dalloway. "Really it's better than
walking in the country!"
"We've just come up" said Hugh Whitbread. "Unfortunately to see
doctors."
"Milly?" said Mrs Dalloway, instantly compassionate.
"Out of sorts," said Hugh Whitbread. "That sort of thing. Dick all
right?"
"First rate!" said Clarissa.
Of course, she thought, walking on, Milly is about my age—fifty—
fifty-two. So it is probably that, Hugh's manner had said so, said it
perfectly—dear old Hugh, thought Mrs Dalloway, remembering with
amusement, with gratitude, with emotion, how shy, like a brother—
one would rather die than speak to one's brother—Hugh had always
been, when he was at Oxford, and came over, and perhaps one of
them (drat the thing!) couldn't ride. How then could women sit in
Parliament? How could they do things with men? For there is this
extraordinarily deep instinct, something inside one; you can't get
over it; it's no use trying; and men like Hugh respect it without our
saying it, which is what one loves, thought Clarissa, in dear old
Hugh.
She had passed through the Admiralty Arch and saw at the end of
the empty road with its thin trees Victoria's white mound, Victoria's
billowing motherliness, amplitude and homeliness, always ridiculous,
yet how sublime, thought Mrs Dalloway, remembering Kensington
Gardens and the old lady in horn spectacles and being told by Nanny
to stop dead still and bow to the Queen. The flag flew above the
Palace. The King and Queen were back then. Dick had met her at
lunch the other day—a thoroughly nice woman. It matters so much
to the poor, thought Clarissa, and to the soldiers. A man in bronze
stood heroically on a pedestal with a gun on her left hand side—the
South African war. It matters, thought Mrs Dalloway walking towards
Buckingham Palace. There it stood four-square, in the broad
sunshine, uncompromising, plain. But it was character she thought;
something inborn in the race; what Indians respected. The Queen
went to hospitals, opened bazaars—the Queen of England, thought
Clarissa, looking at the Palace. Already at this hour a motor car
passed out at the gates; soldiers saluted; the gates were shut. And
Clarissa, crossing the road, entered the Park, holding herself upright.
June had drawn out every leaf on the trees. The mothers of
Westminster with mottled breasts gave suck to their young. Quite
respectable girls lay stretched on the grass. An elderly man, stooping
very stiffly, picked up a crumpled paper, spread it out flat and flung it
away. How horrible! Last night at the Embassy Sir Dighton had said
"If I want a fellow to hold my horse, I have only to put up my hand."
But the religious question is far more serious than the economic, Sir
Dighton had said, which she thought extraordinarily interesting, from
a man like Sir Dighton. "Oh, the country will never know what it has
lost" he had said, talking, of his own accord, about dear Jack
Stewart.
She mounted the little hill lightly. The air stirred with energy.
Messages were passing from the Fleet to the Admiralty. Piccadilly
and Arlington Street and the Mall seemed to chafe the very air in the
Park and lift its leaves hotly, brilliantly, upon waves of that divine
vitality which Clarissa loved. To ride; to dance; she had adored all
that. Or going long walks in the country, talking, about books, what
to do with one's life, for young people were amazingly priggish—oh,
the things one had said! But one had conviction. Middle age is the
devil. People like Jack'll never know that, she thought; for he never
once thought of death, never, they said, knew he was dying. And
now can never mourn—how did it go?—a head grown grey. . . .
From the contagion of the world's slow stain . . . have drunk their
cup a round or two before. . . . From the contagion of the world's
slow stain! She held herself upright.
But how Jack would have shouted! Quoting Shelley, in Piccadilly!
"You want a pin," he would have said. He hated frumps. "My God
Clarissa! My God Clarissa!"—she could hear him now at the
Devonshire House party, about poor Sylvia Hunt in her amber
necklace and that dowdy old silk. Clarissa held herself upright for
she had spoken aloud and now she was in Piccadilly, passing the
house with the slender green columns, and the balconies; passing
club windows full of newspapers; passing old Lady Burdett Coutts'
house where the glazed white parrot used to hang; and Devonshire
House, without its gilt leopards; and Claridge's, where she must
remember Dick wanted her to leave a card on Mrs Jepson or she
would be gone. Rich Americans can be very charming. There was St
James palace; like a child's game with bricks; and now—she had
passed Bond Street—she was by Hatchard's book shop. The stream
was endless—endless—endless. Lords, Ascot, Hurlingham—what was
it? What a duck, she thought, looking at the frontispiece of some
book of memoirs spread wide in the bow window, Sir Joshua
perhaps or Romney; arch, bright, demure; the sort of girl—like her
own Elizabeth—the only real sort of girl. And there was that absurd
book, Soapy Sponge, which Jim used to quote by the yard; and
Shakespeare's Sonnets. She knew them by heart. Phil and she had
argued all day about the Dark Lady, and Dick had said straight out at
dinner that night that he had never heard of her. Really, she had
married him for that! He had never read Shakespeare! There must
be some little cheap book she could buy for Milly—Cranford of
course! Was there ever anything so enchanting as the cow in
petticoats? If only people had that sort of humour, that sort of self-
respect now, thought Clarissa, for she remembered the broad pages;
the sentences ending; the characters—how one talked about them
as if they were real. For all the great things one must go to the past,
she thought. From the contagion of the world's slow stain. . . . Fear
no more the heat o' the sun. . . . And now can never mourn, can
never mourn, she repeated, her eyes straying over the window; for it
ran in her head; the test of great poetry; the moderns had never
written anything one wanted to read about death, she thought; and
turned.
Omnibuses joined motor cars; motor cars vans; vans taxicabs;
taxicabs motor cars—here was an open motor car with a girl, alone.
Up till four, her feet tingling, I know, thought Clarissa, for the girl
looked washed out, half asleep, in the corner of the car after the
dance. And another car came; and another. No! No! No! Clarissa
smiled good-naturedly. The fat lady had taken every sort of trouble,
but diamonds! orchids! at this hour of the morning! No! No! No! The
excellent policeman would, when the time came, hold up his hand.
Another motor car passed. How utterly unattractive! Why should a
girl of that age paint black round her eyes? And a young man, with a
girl, at this hour, when the country—The admirable policeman raised
his hand and Clarissa acknowledging his sway, taking her time,
crossed, walked towards Bond Street; saw the narrow crooked
street, the yellow banners; the thick notched telegraph wires
stretched across the sky.
A hundred years ago her great-great-grandfather, Seymour Parry,
who ran away with Conway's daughter, had walked down Bond
Street. Down Bond Street the Parrys had walked for a hundred
years, and might have met the Dalloways (Leighs on the mother's
side) going up. Her father got his clothes from Hill's. There was a roll
of cloth in the window, and here just one jar on a black table,
incredibly expensive; like the thick pink salmon on the ice block at
the fishmonger's. The jewels were exquisite—pink and orange stars,
paste, Spanish, she thought, and chains of old gold; starry buckles,
little brooches which had been worn on sea green satin by ladies
with high head-dresses. But no good looking! One must economize.
She must go on past the picture dealer's where one of the odd
French pictures hung, as if people had thrown confetti—pink and
blue—for a joke. If you had lived with pictures (and it's the same
with books and music) thought Clarissa, passing the Aeolian Hall,
you can't be taken in by a joke.
The river of Bond Street was clogged. There, like a Queen at a
tournament, raised, regal, was Lady Bexborough. She sat in her
carriage, upright, alone, looking through her glasses. The white
glove was loose at her wrist. She was in black, quite shabby, yet,
thought Clarissa, how extraordinarily it tells, breeding, self-respect,
never saying a word too much or letting people gossip; an
astonishing friend; no one can pick a hole in her after all these
years, and now, there she is, thought Clarissa, passing the Countess
who waited powdered, perfectly still, and Clarissa would have given
anything to be like that, the mistress of Clarefield, talking politics,
like a man. But she never goes anywhere, thought Clarissa, and it's
quite useless to ask her, and the carriage went on and Lady
Bexborough was borne past like a Queen at a tournament, though
she had nothing to live for and the old man is failing and they say
she is sick of it all, thought Clarissa and the tears actually rose to her
eyes as she entered the shop.
"Good morning" said Clarissa in her charming voice. "Gloves" she
said with her exquisite friendliness and putting her bag on the
counter began, very slowly, to undo the buttons. "White gloves" she
said. "Above the elbow" and she looked straight into the
shopwoman's face—but this was not the girl she remembered? She
looked quite old. "These really don't fit" said Clarissa. The shop girl
looked at them. "Madame wears bracelets?" Clarissa spread out her
fingers. "Perhaps it's my rings." And the girl took the grey gloves
with her to the end of the counter.
Yes, thought Clarissa, if it's the girl I remember she's twenty years
older. . . . There was only one other customer, sitting sideways at the
counter, her elbow poised, her bare hand drooping, vacant; like a
figure on a Japanese fan, thought Clarissa, too vacant perhaps, yet
some men would adore her. The lady shook her head sadly. Again
the gloves were too large. She turned round the glass. "Above the
wrist" she reproached the grey-headed woman; who looked and
agreed.
They waited; a clock ticked; Bond Street hummed, dulled, distant;
the woman went away holding gloves. "Above the wrist" said the
lady, mournfully, raising her voice. And she would have to order
chairs, ices, flowers, and cloak-room tickets, thought Clarissa. The
people she didn't want would come; the others wouldn't. She would
stand by the door. They sold stockings—silk stockings. A lady is
known by her gloves and her shoes, old Uncle William used to say.
And through the hanging silk stockings quivering silver she looked at
the lady, sloping shouldered, her hand drooping, her bag slipping,
her eyes vacantly on the floor. It would be intolerable if dowdy
women came to her party! Would one have liked Keats if he had
worn red socks? Oh, at last—she drew into the counter and it
flashed into her mind:
"Do you remember before the war you had gloves with pearl
buttons?"
"French gloves, Madame?"
"Yes, they were French" said Clarissa. The other lady rose very sadly
and took her bag, and looked at the gloves on the counter. But they
were all too large—always too large at the wrist.
"With pearl buttons" said the shop-girl, who looked ever so much
older. She split the lengths of tissue paper apart on the counter. With
pearl buttons, thought Clarissa, perfectly simple—how French!
"Madame's hands are so slender" said the shop girl, drawing the
glove firmly, smoothly, down over her rings. And Clarissa looked at
her arm in the looking glass. The glove hardly came to the elbow.
Were there others half an inch longer? Still it seemed tiresome to
bother her—perhaps the one day in the month, thought Clarissa,
when it's an agony to stand. "Oh, don't bother" she said. But the
gloves were brought.
"Don't you get fearfully tired" she said in her charming voice,
"standing? When d'you get your holiday?"
"In September, Madame, when we're not so busy."
When we're in the country thought Clarissa. Or shooting. She has a
fortnight at Brighton. In some stuffy lodging. The landlady takes the
sugar. Nothing would be easier than to send her to Mrs Lumley's
right in the country (and it was on the tip of her tongue). But then
she remembered how on their honeymoon Dick had shown her the
folly of giving impulsively. It was much more important, he said, to
get trade with China. Of course he was right. And she could feel the
girl wouldn't like to be given things. There she was in her place. So
was Dick. Selling gloves was her job. She had her own sorrows quite
separate, "and now can never mourn, can never mourn" the words
ran in her head, "From the contagion of the world's slow stain"
thought Clarissa holding her arm stiff, for there are moments when it
seems utterly futile (the glove was drawn off leaving her arm flecked
with powder)—simply one doesn't believe, thought Clarissa, any
more in God.
The traffic suddenly roared; the silk stockings brightened. A
customer came in.
"White gloves," she said, with some ring in her voice that Clarissa
remembered.
It used, thought Clarissa, to be so simple. Down down through the
air came the caw of the rooks. When Sylvia died, hundreds of years
ago, the yew hedges looked so lovely with the diamond webs in the
mist before early church. But if Dick were to die to-morrow as for
believing in God—no, she would let the children choose, but for
herself, like Lady Bexborough, who opened the bazaar, they say, with
the telegram in her hand—Roden, her favourite, killed—she would
go on. But why, if one doesn't believe? For the sake of others, she
thought, taking the glove in her hand. This girl would be much more
unhappy if she didn't believe.
"Thirty shillings" said the shopwoman. "No, pardon me Madame,
thirty-five. The French gloves are more."
For one doesn't live for oneself, thought Clarissa.
And then the other customer took a glove, tugged it, and it split.
"There!" she exclaimed.
"A fault of the skin," said the grey-headed woman hurriedly.
"Sometimes a drop of acid in tanning. Try this pair, Madame."
"But it's an awful swindle to ask two pound ten!"
Clarissa looked at the lady; the lady looked at Clarissa.
"Gloves have never been quite so reliable since the war" said the
shop-girl, apologizing, to Clarissa.
But where had she seen the other lady?—elderly, with a frill under
her chin; wearing a black ribbon for gold eyeglasses; sensual, clever,
like a Sargent drawing. How one can tell from a voice when people
are in the habit, thought Clarissa, of making other people—"It's a
shade too tight" she said—obey. The shopwoman went off again.
Clarissa was left waiting. Fear no more she repeated, playing her
finger on the counter. Fear no more the heat o' the sun. Fear no
more she repeated. There were little brown spots on her arm. And
the girl crawled like a snail. Thou thy wordly task hast done.
Thousands of young men had died that things might go on. At last!
Half an inch above the elbow; pearl buttons; five and a quarter. My
dear slow coach, thought Clarissa, do you think I can sit here the
whole morning? Now you'll take twenty-five minutes to bring me my
change!
There was a violent explosion in the street outside. The shopwomen
cowered behind the counters. But Clarissa, sitting very up-right,
smiled at the other lady. "Miss Anstruther!" she exclaimed.
*** END OF THE PROJECT GUTENBERG EBOOK MRS DALLOWAY IN
BOND STREET ***
Updated editions will replace the previous one—the old editions will
be renamed.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
ebookgate.com