0% found this document useful (0 votes)
15 views77 pages

Algorithms For Next Generation Sequencing 1st Edition Sung All Chapter Instant Download

The document provides information about various ebooks available for download on ebookgate.com, including titles related to next-generation sequencing and data mining. It highlights specific editions and their authors, as well as the formats available for instant download. Additionally, it includes a brief overview of the content structure of the book 'Algorithms for Next Generation Sequencing' by Wing-Kin Sung.

Uploaded by

kalawhuskyd2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views77 pages

Algorithms For Next Generation Sequencing 1st Edition Sung All Chapter Instant Download

The document provides information about various ebooks available for download on ebookgate.com, including titles related to next-generation sequencing and data mining. It highlights specific editions and their authors, as well as the formats available for instant download. Additionally, it includes a brief overview of the content structure of the book 'Algorithms for Next Generation Sequencing' by Wing-Kin Sung.

Uploaded by

kalawhuskyd2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Get the full ebook with Bonus Features for a Better Reading Experience on ebookgate.

com

Algorithms for next generation sequencing 1st


Edition Sung

https://ebookgate.com/product/algorithms-for-next-
generation-sequencing-1st-edition-sung/

OR CLICK HERE

DOWLOAD NOW

Download more ebook instantly today at https://ebookgate.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Next Generation Sequencing Current Technologies and


Applications Jianping Xu (Editor)

https://ebookgate.com/product/next-generation-sequencing-current-
technologies-and-applications-jianping-xu-editor/

ebookgate.com

Hacking The Next Generation 1st Edition Nitesh Dhanjani

https://ebookgate.com/product/hacking-the-next-generation-1st-edition-
nitesh-dhanjani/

ebookgate.com

Fandom the Next Generation 1st Edition Bridget Kies

https://ebookgate.com/product/fandom-the-next-generation-1st-edition-
bridget-kies/

ebookgate.com

Chemical and Biochemical Catalysis for Next Generation


Biofuels 1st Edition Blake A Simmons

https://ebookgate.com/product/chemical-and-biochemical-catalysis-for-
next-generation-biofuels-1st-edition-blake-a-simmons/

ebookgate.com
Next generation of data mining 1st Edition Hillol Kargupta

https://ebookgate.com/product/next-generation-of-data-mining-1st-
edition-hillol-kargupta/

ebookgate.com

Nanoparticle Engineering for Chemical Mechanical


Planarization Fabrication of Next Generation Nanodevices
1st Edition Paik
https://ebookgate.com/product/nanoparticle-engineering-for-chemical-
mechanical-planarization-fabrication-of-next-generation-
nanodevices-1st-edition-paik/
ebookgate.com

Business Strategies for the Next Generation Network


Informa Telecoms Media 1st Edition Nigel Seel

https://ebookgate.com/product/business-strategies-for-the-next-
generation-network-informa-telecoms-media-1st-edition-nigel-seel/

ebookgate.com

Next Generation Excel Modeling in Excel for Analysts and


MBAs 1st Edition Isaac Gottlieb

https://ebookgate.com/product/next-generation-excel-modeling-in-excel-
for-analysts-and-mbas-1st-edition-isaac-gottlieb/

ebookgate.com

The Next Generation of Corporate Universities Mark Allen

https://ebookgate.com/product/the-next-generation-of-corporate-
universities-mark-allen/

ebookgate.com
ALGORITHMS FOR
NEXT-GENERATION SEQUENCING
ALGORITHMS FOR
NEXT-GENERATION SEQUENCING

Wing-Kin Sung
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2017 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20170421

International Standard Book Number-13: 978-1-4665-6550-0 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents

Preface xi

1 Introduction 1
1.1 DNA, RNA, protein and cells . . . . . . . . . . . . . . . . . . 1
1.2 Sequencing technologies . . . . . . . . . . . . . . . . . . . . . 3
1.3 First-generation sequencing . . . . . . . . . . . . . . . . . . . 4
1.4 Second-generation sequencing . . . . . . . . . . . . . . . . . 6
1.4.1 Template preparation . . . . . . . . . . . . . . . . . . 6
1.4.2 Base calling . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.3 Polymerase-mediated methods based on reversible
terminator nucleotides . . . . . . . . . . . . . . . . . . 7
1.4.4 Polymerase-mediated methods based on unmodified
nucleotides . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.5 Ligase-mediated method . . . . . . . . . . . . . . . . . 11
1.5 Third-generation sequencing . . . . . . . . . . . . . . . . . . 12
1.5.1 Single-molecule real-time sequencing . . . . . . . . . . 12
1.5.2 Nanopore sequencing method . . . . . . . . . . . . . . 13
1.5.3 Direct imaging of DNA using electron microscopy . . 15
1.6 Comparison of the three generations of sequencing . . . . . . 16
1.7 Applications of sequencing . . . . . . . . . . . . . . . . . . . 17
1.8 Summary and further reading . . . . . . . . . . . . . . . . . 19
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 NGS file formats 21


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Raw data files: fasta and fastq . . . . . . . . . . . . . . . . . 22
2.3 Alignment files: SAM and BAM . . . . . . . . . . . . . . . . 24
2.3.1 FLAG . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 CIGAR string . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Bed format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Variant Call Format (VCF) . . . . . . . . . . . . . . . . . . . 29
2.6 Format for representing density data . . . . . . . . . . . . . 31
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

v
vi Contents

3 Related algorithms and data structures 35


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Recursion and dynamic programming . . . . . . . . . . . . . 35
3.2.1 Key searching problem . . . . . . . . . . . . . . . . . . 36
3.2.2 Edit-distance problem . . . . . . . . . . . . . . . . . . 37
3.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . 39
3.3.2 Unobserved variable and EM algorithm . . . . . . . . 40
3.4 Hash data structures . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Maintain an associative array by simple hashing . . . 43
3.4.2 Maintain a set using a Bloom filter . . . . . . . . . . . 45
3.4.3 Maintain a multiset using a counting Bloom filter . . . 46
3.4.4 Estimating the similarity of two sets using minHash . 47
3.5 Full-text index . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 Suffix trie and suffix tree . . . . . . . . . . . . . . . . 49
3.5.2 Suffix array . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.3 FM-index . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.3.1 Inverting the BWT B to the original text T 53
3.5.3.2 Simulate a suffix array using the FM-index . 54
3.5.3.3 Pattern matching . . . . . . . . . . . . . . . 55
3.5.4 Simulate a suffix trie using the FM-index . . . . . . . 55
3.5.5 Bi-directional BWT . . . . . . . . . . . . . . . . . . . 56
3.6 Data compression techniques . . . . . . . . . . . . . . . . . . 58
3.6.1 Data compression and entropy . . . . . . . . . . . . . 58
3.6.2 Unary, gamma, and delta coding . . . . . . . . . . . . 59
3.6.3 Golomb code . . . . . . . . . . . . . . . . . . . . . . . 60
3.6.4 Huffman coding . . . . . . . . . . . . . . . . . . . . . . 60
3.6.5 Arithmetic code . . . . . . . . . . . . . . . . . . . . . 62
3.6.6 Order-k Markov Chain . . . . . . . . . . . . . . . . . . 64
3.6.7 Run-length encoding . . . . . . . . . . . . . . . . . . . 65
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4 NGS read mapping 69


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Overview of the read mapping problem . . . . . . . . . . . . 70
4.2.1 Mapping reads with no quality score . . . . . . . . . . 70
4.2.2 Mapping reads with a quality score . . . . . . . . . . . 71
4.2.3 Brute-force solution . . . . . . . . . . . . . . . . . . . 72
4.2.4 Mapping quality . . . . . . . . . . . . . . . . . . . . . 74
4.2.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Align reads allowing a small number of mismatches . . . . . 76
4.3.1 Mismatch seed hashing approach . . . . . . . . . . . . 77
4.3.2 Read hashing with a spaced seed . . . . . . . . . . . . 78
4.3.3 Reference hashing approach . . . . . . . . . . . . . . . 82
4.3.4 Suffix trie-based approaches . . . . . . . . . . . . . . . 84
Contents vii

4.3.4.1 Estimating the lower bound of the number of


mismatches . . . . . . . . . . . . . . . . . . . 87
4.3.4.2 Divide and conquer with the enhanced pigeon-
hole principle . . . . . . . . . . . . . . . . . . 89
4.3.4.3 Aligning a set of reads together . . . . . . . 92
4.3.4.4 Speed up utilizing the quality score . . . . . 94
4.4 Aligning reads allowing a small number of mismatches
and indels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.1 q-mer approach . . . . . . . . . . . . . . . . . . . . . . 97
4.4.2 Computing alignment using a suffix trie . . . . . . . . 99
4.4.2.1 Computing the edit distance using a suffix trie 100
4.4.2.2 Local alignment using a suffix trie . . . . . . 103
4.5 Aligning reads in general . . . . . . . . . . . . . . . . . . . . 105
4.5.1 Seed-and-extension approach . . . . . . . . . . . . . . 107
4.5.1.1 BWA-SW . . . . . . . . . . . . . . . . . . . . 108
4.5.1.2 Bowtie 2 . . . . . . . . . . . . . . . . . . . . 109
4.5.1.3 BatAlign . . . . . . . . . . . . . . . . . . . . 110
4.5.1.4 Cushaw2 . . . . . . . . . . . . . . . . . . . . 111
4.5.1.5 BWA-MEM . . . . . . . . . . . . . . . . . . . 112
4.5.1.6 LAST . . . . . . . . . . . . . . . . . . . . . . 113
4.5.2 Filter-based approach . . . . . . . . . . . . . . . . . . 114
4.6 Paired-end alignment . . . . . . . . . . . . . . . . . . . . . . 116
4.7 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5 Genome assembly 123


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2 Whole genome shotgun sequencing . . . . . . . . . . . . . . . 124
5.2.1 Whole genome sequencing . . . . . . . . . . . . . . . . 124
5.2.2 Mate-pair sequencing . . . . . . . . . . . . . . . . . . 126
5.3 De novo genome assembly for short reads . . . . . . . . . . . 126
5.3.1 Read error correction . . . . . . . . . . . . . . . . . . 128
5.3.1.1 Spectral alignment problem (SAP) . . . . . . 129
5.3.1.2 k-mer counting . . . . . . . . . . . . . . . . . 133
5.3.2 Base-by-base extension approach . . . . . . . . . . . . 138
5.3.3 De Bruijn graph approach . . . . . . . . . . . . . . . . 141
5.3.3.1 De Bruijn assembler (no sequencing error) . 143
5.3.3.2 De Bruijn assembler (with sequencing errors) 144
5.3.3.3 How to select k . . . . . . . . . . . . . . . . . 146
5.3.3.4 Additional issues of the de Bruijn graph
approach . . . . . . . . . . . . . . . . . . . . 147
5.3.4 Scaffolding . . . . . . . . . . . . . . . . . . . . . . . . 150
5.3.5 Gap filling . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.4 Genome assembly for long reads . . . . . . . . . . . . . . . . 154
viii Contents

5.4.1 Assemble long reads assuming long reads have a low


sequencing error rate . . . . . . . . . . . . . . . . . . . 155
5.4.2 Hybrid approach . . . . . . . . . . . . . . . . . . . . . 157
5.4.2.1 Use mate-pair reads and long reads to improve
the assembly from short reads . . . . . . . . 160
5.4.2.2 Use short reads to correct errors in long reads 160
5.4.3 Long read approach . . . . . . . . . . . . . . . . . . . 161
5.4.3.1 MinHash for all-versus-all pairwise alignment 162
5.4.3.2 Computing consensus using Falcon Sense . . 163
5.4.3.3 Quiver consensus algorithm . . . . . . . . . . 165
5.5 How to evaluate the goodness of an assembly . . . . . . . . . 168
5.6 Discussion and further reading . . . . . . . . . . . . . . . . . 168
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

6 Single nucleotide variation (SNV) calling 175


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.1.1 What are SNVs and small indels? . . . . . . . . . . . 175
6.1.2 Somatic and germline mutations . . . . . . . . . . . . 178
6.2 Determine variations by resequencing . . . . . . . . . . . . . 178
6.2.1 Exome/targeted sequencing . . . . . . . . . . . . . . . 179
6.2.2 Detection of somatic and germline variations . . . . . 180
6.3 Single locus SNV calling . . . . . . . . . . . . . . . . . . . . 180
6.3.1 Identifying SNVs by counting alleles . . . . . . . . . . 181
6.3.2 Identify SNVs by binomial distribution . . . . . . . . 182
6.3.3 Identify SNVs by Poisson-binomial distribution . . . . 184
6.3.4 Identifying SNVs by the Bayesian approach . . . . . . 185
6.4 Single locus somatic SNV calling . . . . . . . . . . . . . . . . 187
6.4.1 Identify somatic SNVs by the Fisher exact test . . . . 187
6.4.2 Identify somatic SNVs by verifying that the SNVs
appear in the tumor only . . . . . . . . . . . . . . . . 188
6.4.2.1 Identify SNVs in the tumor sample by
posterior odds ratio . . . . . . . . . . . . . . 188
6.4.2.2 Verify if an SNV is somatic by the posterior
odds ratio . . . . . . . . . . . . . . . . . . . . 191
6.5 General pipeline for calling SNVs . . . . . . . . . . . . . . . 192
6.6 Local realignment . . . . . . . . . . . . . . . . . . . . . . . . 193
6.7 Duplicate read marking . . . . . . . . . . . . . . . . . . . . . 195
6.8 Base quality score recalibration . . . . . . . . . . . . . . . . 195
6.9 Rule-based filtering . . . . . . . . . . . . . . . . . . . . . . . 198
6.10 Computational methods to identify small indels . . . . . . . 199
6.10.1 Split-read approach . . . . . . . . . . . . . . . . . . . 199
6.10.2 Span distribution-based clustering approach . . . . . . 200
6.10.3 Local assembly approach . . . . . . . . . . . . . . . . 203
6.11 Correctness of existing SNV and indel callers . . . . . . . . . 204
6.12 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 205
Contents ix

6.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

7 Structural variation calling 209


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.2 Formation of SVs . . . . . . . . . . . . . . . . . . . . . . . . 211
7.3 Clinical effects of structural variations . . . . . . . . . . . . . 214
7.4 Methods for determining structural variations . . . . . . . . 215
7.5 CNV calling . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
7.5.1 Computing the raw read count . . . . . . . . . . . . . 218
7.5.2 Normalize the read counts . . . . . . . . . . . . . . . . 219
7.5.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . 219
7.6 SV calling pipeline . . . . . . . . . . . . . . . . . . . . . . . . 222
7.6.1 Insert size estimation . . . . . . . . . . . . . . . . . . . 222
7.7 Classifying the paired-end read alignments . . . . . . . . . . 223
7.8 Identifying candidate SVs from paired-end reads . . . . . . . 226
7.8.1 Clustering approach . . . . . . . . . . . . . . . . . . . 227
7.8.1.1 Clique-finding approach . . . . . . . . . . . . 228
7.8.1.2 Confidence interval overlapping approach . . 229
7.8.1.3 Set cover approach . . . . . . . . . . . . . . . 233
7.8.1.4 Performance of the clustering approach . . . 236
7.8.2 Split-mapping approach . . . . . . . . . . . . . . . . . 236
7.8.3 Assembly approach . . . . . . . . . . . . . . . . . . . . 237
7.8.4 Hybrid approach . . . . . . . . . . . . . . . . . . . . . 238
7.9 Verify the SVs . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.10 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

8 RNA-seq 245
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.2 High-throughput methods to study the transcriptome . . . . 247
8.3 Application of RNA-seq . . . . . . . . . . . . . . . . . . . . . 248
8.4 Computational Problems of RNA-seq . . . . . . . . . . . . . 250
8.5 RNA-seq read mapping . . . . . . . . . . . . . . . . . . . . . 250
8.5.1 Features used in RNA-seq read mapping . . . . . . . . 250
8.5.1.1 Transcript model . . . . . . . . . . . . . . . . 250
8.5.1.2 Splice junction signals . . . . . . . . . . . . . 252
8.5.2 Exon-first approach . . . . . . . . . . . . . . . . . . . 253
8.5.3 Seed-and-extend approach . . . . . . . . . . . . . . . . 256
8.6 Construction of isoforms . . . . . . . . . . . . . . . . . . . . 260
8.7 Estimating expression level of each transcript . . . . . . . . . 261
8.7.1 Estimating transcript abundances when every read
maps to exactly one transcript . . . . . . . . . . . . . 261
8.7.2 Estimating transcript abundances when a read maps to
multiple isoforms . . . . . . . . . . . . . . . . . . . . . 264
8.7.3 Estimating gene abundance . . . . . . . . . . . . . . . 266
x Contents

8.8 Summary and further reading . . . . . . . . . . . . . . . . . 268


8.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

9 Peak calling methods 271


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
9.2 Techniques that generate density-based datasets . . . . . . . 271
9.2.1 Protein DNA interaction . . . . . . . . . . . . . . . . . 271
9.2.2 Epigenetics of our genome . . . . . . . . . . . . . . . . 273
9.2.3 Open chromatin . . . . . . . . . . . . . . . . . . . . . 274
9.3 Peak calling methods . . . . . . . . . . . . . . . . . . . . . . 274
9.3.1 Model fragment length . . . . . . . . . . . . . . . . . . 276
9.3.2 Modeling noise using a control library . . . . . . . . . 279
9.3.3 Noise in the sample library . . . . . . . . . . . . . . . 280
9.3.4 Determination if a peak is significant . . . . . . . . . . 281
9.3.5 Unannotated high copy number regions . . . . . . . . 283
9.3.6 Constructing a signal profile by Kernel methods . . . 284
9.4 Sequencing depth of the ChIP-seq libraries . . . . . . . . . . 285
9.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 286
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

10 Data compression techniques used in NGS files 289


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
10.2 Strategies for compressing fasta/fastq files . . . . . . . . . . 290
10.3 Techniques to compress identifiers . . . . . . . . . . . . . . . 290
10.4 Techniques to compress DNA bases . . . . . . . . . . . . . . 291
10.4.1 Statistical-based approach . . . . . . . . . . . . . . . . 291
10.4.2 BWT-based approach . . . . . . . . . . . . . . . . . . 292
10.4.3 Reference-based approach . . . . . . . . . . . . . . . . 295
10.4.4 Assembly-based approach . . . . . . . . . . . . . . . . 297
10.5 Quality score compression methods . . . . . . . . . . . . . . 299
10.5.1 Lossless compression . . . . . . . . . . . . . . . . . . . 300
10.5.2 Lossy compression . . . . . . . . . . . . . . . . . . . . 301
10.6 Compression of other NGS data . . . . . . . . . . . . . . . . 302
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

References 307

Index 339
Preface

Next-generation sequencing (NGS) is a recently developed technology enabling


us to generate hundreds of billions of DNA bases from the samples. We can
use NGS to reconstruct the genome, understand genomic variations, recover
transcriptomes, and identify the transcription factor binding sites or the epi-
genetic marks.
The NGS technology radically changes how we collect genomic data from
the samples. Instead of studying a particular gene or a particular genomic re-
gion, NGS technologies enable us to perform genome-wide study unbiasedly.
Although more raw data can be obtained from sequencing machines, we face
computational challenges in analyzing such a big dataset. Hence, it is impor-
tant to develop efficient and accurate computational methods to analyze or
process such datasets. This book is intended to give an in-depth introduction
to such algorithmic techniques.
The primary audiences of this book include advanced undergraduate stu-
dents and graduate students who are from mathematics or computer science
departments. We assume that readers have some training in college-level bi-
ology, statistics, discrete mathematics and algorithms.
This book was developed partly from the teaching material for the course
on Combinatorial Methods in Bioinformatics, which I taught at the National
University of Singapore, Singapore. The chapters in this book are classified
based on the application domains of the NGS technologies. In each chapter, a
brief introduction to the technology is first given. Then, different methods or
algorithms for analyzing such NGS datasets are described. To illustrate each
algorithm, detailed examples are given. At the end of each chapter, exercises
are given to help readers understand the topics.
Chapter 1 introduces the next-generation sequencing technologies. We
cover the three generations of sequencing, starting from Sanger sequencing
(first generation). Then, we cover second-generation sequencing, which in-
cludes Illumina Solexa sequencing. Finally, we describe the third-generation
sequencing technologies which include PacBio sequencing and nanopore se-
quencing.
Chapter 2 introduces a few NGS file formats, which facilitate downstream
analysis and data transfer. They include fasta, fastq, SAM, BAM, BED, VCF
and WIG formats. Fasta and fastq are file formats for describing the raw
sequencing reads generated by the sequencers. SAM and BAM are file formats

xi
xii Preface

for describing the alignments of the NGS reads on the reference genome. BED,
VCF and WIG formats are annotation formats.
To develop methods for processing NGS data, we need efficient algorithms
and data structures. Chapter 3 is devoted to briefly describing these tech-
niques.
Chapter 4 studies read mappers. Read mappers align the NGS reads on
the reference genome. The input is a set of raw reads in fasta or fastq files.
The read mapper will align each raw read on the reference genome, that is,
identify the region in the reference genome which is highly similar to the read.
Then, the read mapper will output all these alignments in a SAM or BAM
file. This is the basic step for many NGS applications. (It is the first step for
the methods in Chapters 6−9.)
Chapter 5 studies the de novo assembly problem. Given a set of raw reads
extracted from whole genome sequencing of some sample genome, de novo
assembly aims to stitch the raw reads together to reconstruct the genome.
It enables us to reconstruct novel genomes like plants and bacteria. De novo
assembly involves a few steps: error correction, contig assembly (de Bruijn
graph approach or base-by-base extension approach), scaffolding and gap fill-
ing. This chapter describes techniques developed for these steps.
Chapter 6 discusses the problem of identifying single nucleotide variations
(SNVs) and small insertions/deletions (indels) in an individual genome. The
genome of every individual is highly similar to the reference human genome.
However, each genome is still different from the reference genome. On average,
there is 1 single nucleotide variation in every 3000 bases and 1 small indel in
every 1000 bases. To discover these variations, we can first perform whole
genome sequencing or exome sequencing of the individual genome to obtain
a set of raw reads. After aligning the raw reads on the reference genome, we
use SNV callers and indel callers to call SNVs and small indels. This chapter
is devoted to discussing techniques used in SNV callers and indel callers.
Apart from SNVs and small indels, copy number variations (CNVs) and
structural variations (SVs) are the other types of variations that appear in our
genome. CNVs and SVs are not as frequent as SNVs and indels. Moreover, they
are more prone to change the phenotype. Hence, it is important to understand
them. Chapter 7 is devoted to studying techniques used in CNV callers and
SV callers.
All above technologies are related to genome sequencing. We can also se-
quence RNA. This technology is known as RNA-seq. Chapter 8 studies meth-
ods for analyzing RNA-seq. By applying computational methods on RNA-seq,
we can recover the transcriptome. More precisely, RNA-seq enables us to iden-
tify exons and split junctions. Then, we can predict the isoforms of the genes.
We can also determine the expression of each transcript and each gene.
By combining Chromatin immunoprecipitation and next-generation se-
quencing, we can sequence genome regions that are bound by some transcrip-
tion factors or with epigenetic marks. Such technology is known as ChIP-
seq. The computational methods that identify those binding sites are known
Preface xiii

as ChIP-seq peak callers. Chapter 9 is devoted to discussing computational


methods for such purpose.
As stated earlier, NGS data is huge; and the NGS data files are usually
big. It is difficult to store and transfer NGS files. One solution is to com-
press the NGS data files. Nowadays, a number of compression methods have
been developed and some of the compression formats are used frequently in
the literatures like BAM, bigBed and bigWig. Chapter 10 aims to describe
these compression techniques. We also describe techniques that enable us to
randomly access the compressed NGS data files.
Supplementary material can be found at
http://www.comp.nus.edu.sg/∼ksung/algo in ngs/.
I would like to thank my PhD supervisors Tak-Wah Lam and Hing-
Fung Ting and my collaborators Francis Y. L. Chin, Kwok Pui Choi, Ed-
win Cheung, Axel Hillmer, Wing Kai Hon, Jansson Jesper, Ming-Yang Kao,
Caroline Lee, Nikki Lee, Hon Wai Leong, Alexander Lezhava, John Luk, See-
Kiong Ng, Franco P. Preparata, Yijun Ruan, Kunihiko Sadakane, Chialin Wei,
Limsoon Wong, Siu-Ming Yiu, and Louxin Zhang. My knowledge of NGS and
bioinformatics was enriched through numerous discussions with them. I would
like to thank Ramesh Rajaby, Kunihiko Sadakane, Chandana Tennakoon,
Hugo Willy, and Han Xu for helping to proofread some of the chapters. I
would also like to thank my parents Kang Fai Sung and Siu King Wong, my
three brothers Wing Hong Sung, Wing Keung Sung, and Wing Fu Sung, my
wife Lily Or, and my three kids Kelly, Kathleen and Kayden for their support.
Finally, if you have any suggestions for improvement or if you identify any
errors in the book, please send an email to me at [email protected]. I
thank you in advance for your helpful comments in improving the book.

Wing-Kin Sung
Chapter 1
Introduction

DNA stands for deoxyribonucleic acid. It was first discovered in 1869 by


Friedrich Miescher [58]. However, it was not until 1944 that Avery, MacLeod
and McCarty [12] demonstrated that DNA is the major carrier of genetic in-
formation, not protein. In 1953, James Watson and Francis Crick discovered
the basic structure of DNA, which is a double helix [310]. After that, people
started to work on DNA intensively.
DNA sequencing sprang to life in 1972, when Frederick Sanger (at the Uni-
versity of Cambridge, England) began to work on the genome sequence using
a variation of the recombinant DNA method. The full DNA sequence of a viral
genome (bacteriophage φX174) was completed by Sanger in 1977 [259, 260].
Based on the power of sequencing, Sanger established genomics,1 which is the
study of the entirety of an organism’s hereditary information, encoded in DNA
(or RNA for certain viruses). Note that it is different from molecular biology
or genetics, whose primary focus is to investigate the roles and functions of
single genes.
During the last decades, DNA sequencing has improved rapidly. We can
sequence the whole human genome within a day and compare multiple individ-
ual human genomes. This book is devoted to understanding the bioinformatics
issues related to DNA sequencing. In this introduction, we briefly review DNA,
RNA and protein. Then, we describe various sequencing technologies. Lastly,
we describe the applications of sequencing technologies.

1.1 DNA, RNA, protein and cells


Deoxyribonucleic acid (DNA) is used as the genetic material (with the
exception that certain viruses use RNA as the genetic material). The basic
building block of DNA is the DNA nucleotide. There are 4 types of DNA
nucleotides: adenine (A), guanine (G), cytosine (C) and thymine (T). The DNA

1 The actual term “genomics” is thought to have been coined by Dr. Tom Roderick, a

geneticist at the Jackson Laboratory (Bar Harbor, ME) at a meeting held in Maryland on
the mapping of the human genome in 1986.

1
2 Algorithms for Next-Generation Sequencing

50 − A C G T A G C T −30
|| ||| ||| || || ||| ||| ||
30 − T G C A T C G A −50

FIGURE 1.1: The double-stranded DNA. The two strands show a comple-
mentary base pairing.

nucleotides can be chained together to form a strand of DNA. Each strand of


DNA is asymmetric. It begins from 50 end and ends at 30 end.
When two opposing DNA strands satisfy the Watson-Crick rule, they can
be interwoven together by hydrogen bonds and form a double-stranded DNA.
The Watson-Crick rule (or complementary base pairing rule) requires that the
two nucleotides in opposing strands be a complementary base pair, that is,
they must be an (A, T) pair or a (C, G) pair. (Note that A = T and C ≡ G are
bound with the help of two and three hydrogen bonds, respectively.) Figure 1.1
gives an example double-stranded DNA. One strand is ACGTAGCT while the
other strand is its reverse complement, i.e., AGCTACGT.
The double-stranded DNAs are located in the nucleus (and mitochondria)
of every cell. A cell can contain multiple pieces of double-stranded DNAs, each
is called a chromosome. As a whole, the collection of chromosomes is called a
genome; the human genome consists of 23 pairs of chromosomes, and its total
length is roughly 3 billion base pairs.
The genome provides the instructions for the cell to perform daily life
functions. Through the process of transcription, the machine RNA polymerase
transcribes genes (the basic functional units) in our genome into transcripts
(or RNA molecules). This process is known as gene expression. The complete
set of transcripts in a cell is denoted as its transcriptome.
Each transcript is a chain of 4 different ribonucleic acid (RNA) nucleotides:
adenine (A), guanine (G), cytosine (C) and uracil (U). The main difference be-
tween the DNA nucleotide and the RNA nucleotide is that the RNA nucleotide
has an extra OH group. This extra OH group enables the RNA nucleotide to
form more hydrogen bonds. Transcripts are usually single stranded instead of
double stranded.
There are two types of transcripts: non-coding RNA (ncRNA) and message
RNA (mRNA). ncRNAs are transcripts that do not translate into proteins.
They can be classified into transfer RNAs (tRNAs), ribosomal RNAs (rRNAs),
short ncRNAs (of length < 30 bp, includes miRNA, siRNA and piRNA) and
long ncRNAs (of length > 200 bp, example includes Xist, and HOTAIR).
mRNA is the intermediate between DNA and protein. Each mRNA con-
sists of three parts: a 5’ untranslated region (a 5’ UTR), a coding region and
a 3’ untranslated region (3’ UTR). The length of the coding region is of a
multiple of 3. It is a sequence of triplets of nucleotides called codons. Each
codon corresponds to an amino acid.
Through translation, the machine ribosome translates each mRNA into a
Introduction 3

protein, which is the sequence of amino acids corresponding to the sequence of


codons in the mRNA. Protein forms complex 3D structures. Each protein is
a biological nanomachine that performs a specialized function. For example,
enzymes are proteins that work as catalysts to promote chemical reactions
for generating energy or digesting food. Other proteins, called transcription
factors, interact with the genome to turn on or off the transcriptions. Through
the interaction among DNA, RNA and protein, our genome dictates which
cells should grow, when cells should die, how cells should be structured, and
creates various body parts.
All cells in our body are developed from a single cell through cell division.
When a cell divides, the double helix genome is separated into single-stranded
DNA molecules. An enzyme called DNA polymerase uses each single-stranded
DNA molecule as the template to replicate the genome into two identical
double helixes. By this replication process, all cells within the same individual
will have the same genome. However, due to errors in copying, some variations
(called mutations) might happen in some cells. Those variations or mutations
may cause diseases such as cancer.
Different individuals have similar genomes, but they also have genome
variations that contribute to different phenotypes. For example, the color of
our hairs and our eyes are controlled by the differences in our genomes. By
studying and comparing genomes of different individuals, researchers develop
an understanding of the factors that cause different phenotypes and diseases.
Such knowledge ultimately helps to gain insights into the mystery of life and
contributes to improving human health.

1.2 Sequencing technologies


DNA sequencing is a process that determines the order of the nucleotide
bases. It translates the DNA of a specific organism into a format that is deci-
pherable by researchers and scientists. DNA sequencing has allowed scientists
to better understand genes and their roles within our body. Such knowledge
has become indispensable for understanding biological processes, as well as in
application fields such as diagnostic or forensic research. The advent of DNA
sequencing has significantly accelerated biological research and discovery.
To facilitate the genomics study, we need to sequence the genomes of differ-
ent species or different individuals. A number of sequencing technologies have
been developed during the last decades. Roughly speaking, the development
of the sequencing technologies consists of three phases:

• First-generation sequencing: Sequencing based on chemical degradation


and gel electrophoresis.
4 Algorithms for Next-Generation Sequencing

• Second-generation sequencing: Sequencing many DNA fragments in par-


allel. It has higher yield, lower cost, but shorter reads.

• Third-generation sequencing: Sequencing a single DNA molecule with-


out the need to halt between read steps.

In this section, we will discuss the three phases in detail.

1.3 First-generation sequencing


Sanger and Coulson proposed the first-generation sequencing in 1975 [259,
260]. It enables us to sequence a DNA template of length 500 − 1000 within a
few hours. The detailed steps are as follows (see Figure 1.3).

1. Amplify the DNA template by cloning.

2. Generate all possible prefixes of the DNA template.

3. Separation by electrophoresis.

4. Readout with fluorescent tags.

Step 1 amplifies the DNA template. The DNA template is inserted into
the plasmid vector; then the plasmid vector is inserted into the host cells for
cloning. By growing the host cells, we obtain many copies of the same DNA
template.
Step 2 generates all possible prefixes of the DNA template. Two tech-
niques have been proposed for this step: (1) the Maxam-Gilbert technique [194]
and (2) the chain termination methodology (Sanger method) [259, 260]. The
Maxam-Gilbert technique relies on the cleaving of nucleotides by chemical.
Four different chemicals are used and generate all sequences ending with A, C, G
and T, respectively. This allows us to generate all possible prefixes of the tem-
plate. This technique is most efficient for short DNA sequences. However, it
is considered unsafe because of the extensive use of toxic chemicals.
The chain termination methodology (Sanger method) is a better alter-
native. Given a single-stranded DNA template, the method performs DNA
polymerase-dependent synthesis in the presence of (1) natural deoxynu-
cleotides (dNTPs) and (2) dideoxynucleotides (ddNTPs). ddNTPs serve as
non-reversible synthesis terminators (see Figure 1.2(a,b)). The DNA synthesis
reaction is randomly terminated whenever a ddNTP is added to the growing
oligonucleotide chain, resulting in truncated products of varying lengths with
an appropriate ddNTP at their 3’ terminus.
After we obtain all possible prefixes of the DNA template, the product is
a mixture of DNA fragments of different lengths. We can separate these DNA
Introduction 5

C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ dATP + H+ + PPi
(a)

C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ ddATP + H+ + PPi
(b)

FIGURE 1.2: (a) The chemical reaction for the incorporation of dATP into
the growing DNA strand. (b) The chemical reaction for the incorporation of
ddATP into the growing DNA strand. The vertical bar behind A indicates
that the extension of the DNA strand is terminated.

3’-GCATCGGCATATG...-5’
5’-CGTA
CGTA G - +
CGTAG C
CGTAGC C
CGTAGCC G
CGTAGCCG T
DNA Insert Insert
CGTAGCCGT A
template into into GCCGTATAC
CGTAGCCGTA T
vector host cell Cloning CGTAGCCGTAT A
CGTAGCCGTATA C Electrophoresis
Cyclic sequencing & readout

FIGURE 1.3: The steps of Sanger sequencing.

fragments by their lengths using gel electrophoresis (Step 3). Gel electrophore-
sis is based on the fact that DNA is negatively charged. When an electrical
field is applied to a mixture of DNA on a gel, the DNA fragments will move
from the negative pole to the positive pole. Due to friction, short DNA frag-
ments travel faster than long DNA fragments. Hence, the gel electrophoresis
separates the mixture into bands, each containing DNA molecules of the same
length.
Using the fluorescent tags attached to the terminal ddNTPs (we have
4 different colors for the 4 different ddNTPs), the DNA fragments ending
with different nucleotides will be labeled with different fluorescent dyes. By
detecting the light emitted from different bands, the DNA sequence of the
template will be revealed (Step 4).
In summary, the Sanger method can generate sequences of length ∼800 bp.
The process can be fully automated and hence it was a popular DNA sequenc-
6 Algorithms for Next-Generation Sequencing

ing method in 1970 − 2000. However, it is expensive and the throughput is


slow. It can only process a limited number of DNA fragments per unit of time.

1.4 Second-generation sequencing


Second-generation sequencing can generate hundreds of millions of short
reads per instrument run. When compared with first-generation sequencing,
it has the following advantages: (1) it uses clone-free amplification, and (2) it
can sequence many reads in parallel. Some commercially available technologies
include Roche/454, Illumina, ABI SOLiD, Ion Torrent, Helicos BioSciences
and Complete Genomics.
In general, second-generation sequencing involves the following two main
steps: (1) Template preparation and (2) base calling in parallel. The following
Section 1.4.1 describes Step 1 while Section 1.4.2 describes Step 2.

1.4.1 Template preparation

Given a set of DNA fragments, the template preparation step first gener-
ates a DNA template for each DNA fragment. The DNA template is created
by ligating adaptor sequences to the two ends of the target DNA fragment (see
Figure 1.4(a)). Then, the templates are amplified using PCR. There are two
common methods for amplifying the templates: (1) emulsion PCR (emPCR)
and (2) solid-phase amplification (Bridge PCR).
emPCR amplifies each DNA template by a bead. First of all, one piece of
DNA template and a bead are inserted within a water drop in oil. The surface
of every bead is coated with a primer corresponding to one type of adaptor.
The DNA template will hybridize with one primer on the surface of the bead.
Then, it is PCR amplified within a water drop in oil. Figure 1.4(b) illustrates
the emPCR. emPCR is used by 454, Ion Torrent and SOLiD.
For bridge PCR, the amplification is done on a flat surface (say, glass),
which is coated with two types of primers, corresponding to the adaptors.
Each DNA template is first hybridized to one primer on the flat surface.
Amplification proceeds in cycles, with one end of each bridge tethered to the
surface. Figure 1.4(c) illustrates the bridge PCR process. Bridge PCR is used
by Illumina.
Although PCR can amplify DNA templates, there is amplification bias.
Experiments revealed that templates that are AT-rich or GC-rich have a lower
amplification efficient. This limitation creates uneven sequencing of the DNA
templates in the sample.
Introduction 7

(a)

templates

beads

water drop in oil template binds PCR for


to the bead a few rounds
(b)

(c)

FIGURE 1.4: (a) From the DNA fragments, DNA template is created by
attaching the two ends with adaptor sequences. (b) Amplifying the template
by emPCR. (c) Amplifying the template by bridge PCR.

1.4.2 Base calling


Now we have many PCR clones of amplified templates (see Figure 1.5(a)).
This step aims to read the DNA sequences from the amplified templates in
parallel. This method is called the cyclic-array method. There are two ap-
proaches: the polymerase-mediated method (also called sequencing by syn-
thesis) and the ligase-mediated method (also called sequencing by ligation).
The polymerase-mediated method is further divided into methods based on re-
versible terminator nucleotides and methods based on unmodified nucleotides.
Below, we will discuss these approaches.

1.4.3 Polymerase-mediated methods based on reversible ter-


minator nucleotides
A reversible terminator nucleotide is a modified nucleotide. Similar to
ddNTPs, during the DNA polymerase-dependent synthesis, if a reversible ter-
minator nucleotide is incorporated onto the DNA template, the DNA synthesis
is terminated. Moreover, we can reverse the termination and restart the DNA
synthesis.
Figure 1.5(b) demonstrates how we use reversible terminator nucleotides
for sequencing. First, we hybridize the primer on the adaptor of the template.
Then, by DNA polymerase, a reversible terminator nucleotide is incorporated
onto the template. After that, we scan the signal of the dye attached to the
8 Algorithms for Next-Generation Sequencing

PCR clone
C C C
T C
T T T
G G G
C G
C C C
A A AC
T A
C C C C
T T T G T
G G G C G
C C C A C
A A A A

(a)

G
C C G C G C
T T T T ……
G G G G
Add After Repeat the
C C C C
reversible scanning, steps to
A A A A
terminator reverse the sequence
dGTP termination other bases

Wash &
scan
(b)

FIGURE 1.5: Polymerase-mediated sequencing methods based on reversible


terminator nucleotides. (a) PCR clones of the DNA templates are evenly dis-
tributed on a flat surface. Each PCR clone contains many DNA templates of
the same type. (b) The steps of polymerase-mediated methods are based on
reversible terminator nucleotides.

reversible terminator nucleotide by imaging. After imaging, the termination


is reversed by cleaving the dye-nucleotide linker. By repeating the steps, we
can sequence the complete DNA template.
Two commercial sequencers use this approach. They are Illumina and He-
licos BioSciences.
The Illumina sequencer amplifies the DNA templates by bridge PCR.
Then, all PCR clones are distributed on the glass plate. By using the four-
color cyclic reversible termination (CRT) cycle (see Figure 1.6(b)), we can
sequence all the DNA templates in parallel.
The major error of Illumina sequencing is substitution error, with a higher
portion of errors occurring when the previous incorporated nucleotide is a
base G.
Another major error of Illumina sequencing is that the accuracy decreases
with increasing nucleotide addition steps. The errors accumulate due to the
failure in cleaving off the fluorescent tags or due to errors in controlling the
Introduction 9

A C
C G
T T
G A
T C

(a)

(b)

A C G T A C G T
(c)

FIGURE 1.6: Polymerase-mediated sequencing methods based on reversible


terminator nucleotides. (a) A flat surface with many PCR clones. In particu-
lar, we show the DNA templates for two clones. (b) Four-color cyclic reversible
termination (CRT) cycle. Within each cycle, we extend the template of each
PCR clone by one base. The color indicates the extended base. Precisely, the
four colors, dark gray, black, white and light gray, correspond to the four nu-
cleotides A, C, G and T, respectively. (c) One-color cyclic reversible termination
(CRT) cycle. Each cycle tries to extend the template of each PCR clone by
one particular base. If the extension is successful, the white color is lighted
up.
10 Algorithms for Next-Generation Sequencing

reversible terminator nucleotides. Then, bases fail to get incorporated to the


template strand or extra bases might get incorporated [190].
Helicos BioSciences does not perform PCR amplification. It is a single
molecular sequencing method. It first immobilizes the DNA template on the
flat surface. Then, all DNA templates on the surface are sequenced in par-
allel by using a one-color cyclic reversible termination (CRT) cycle (see Fig-
ure 1.6(c)). Note that this technology can also be used to sequence RNA
directly by using reverse transcriptase instead of DNA polymerase. However,
the reads generated by Helicos BioSciences are very short (∼25 bp). It is also
slow and expensive.

1.4.4 Polymerase-mediated methods based on unmodified


nucleotides
The previous methods require the use of modified nucleotides. Actually,
we can sequence the DNA templates using unmodified nucleotides. The basic
observation is that the incorporation of a deoxyribonucleotide triphosphate
(dNTP) into a growing DNA strand involves the formation of a covalent bond
and the release of pyrophosphate and a positively charged hydrogen ion (see
Figure 1.2). Hence, it is possible to sequence the DNA template by detecting
the concentration change of pyrophosphate or hydrogen ion. Roche 454 and
Ion Torrent are two sequencers which take advantage of this principle.
The Roche 454 sequencer performs sequencing by detecting pyrophos-
phates. It is called pyrosequencing. First, the 454 sequencer uses emPCR to
amplify the templates. Then, amplified beads are loaded into an array of wells.
(Each well contains one amplified bead which corresponds to one DNA tem-
plate.) In each iteration, a single type of dNTP flows across the wells. If the
dNTP is complementary to the template in a well, polymerase will extend by
one base and relax pyrophosphate. With the help of enzymes sulfurylase and
luciferase, the pyrophosphate is converted into visual light. The CDC camera
detects the light signal from all wells in parallel. For each well, the light inten-
sity generated is recorded as a flowgram. For example, if the DNA template
in a well is TCGGTAAAAAACAGTTTCCT, Figure 1.7 is the corresponding
flowgram. Precisely, the light signal can be detected only when the dNTP that
flows across the well is complementary to the template. If the template has
a homopolymer of length k, the light intensity detected is k-fold higher. By
interpreting the flowgram, we can recover the DNA sequence.
However, when the homopolymer is long (say longer than 6), the detec-
tor is not sensitive enough to report the correct length of the homopolymer.
Therefore, the Roche 454 sequencer gives higher rate of indel errors.
Ion Torrent was created by the person as Roche 454. It is the first semi-
conductor sequencing chip available on the commercial market. Instead of
detecting pyrophosphate, it performs sequencing by detecting hydrogen ions.
The basic method of Ion Torrent is the same as that of Roche 454. It also uses
emPCR to amplify the templates and the amplified beads are also loaded into
Introduction 11

6
5

intensity
4
3
2
1
ACGTACGTACGTACGTACGT

FIGURE 1.7: The flowgram for the DNA sequence TCG-


GTAAAAAACAGTTTCCT.

a high-density array of wells, and each well contains one template. In each
iteration, a single type of dNTP flows across the wells. If the dNTP is comple-
mentary to the template, polymerase will extend by one base and relax H+.
The relaxation of H+ changes the pH of the solution in the well and an IS-
FET sensor at the bottom of the well measures the pH change and converts it
into electric signals [251]. The sensor avoids the use of optical measurements,
which require a complicated camera and laser. This is the main difference
between Ion Torrent sequencing and 454 sequencing. The unattached dNTP
molecules are washed out before the next iteration. By interpreting the flow-
gram obtained from the ISFET sensor, we can recover the sequences of the
templates.
Since the method used by Ion Torrent is similar to that of Roche 454, it
also has the disadvantage that it cannot distinguish long homopolymers.

1.4.5 Ligase-mediated method


Instead of extending the template base by base using polymerase, ligase-
mediated methods use probes to check the bases on the template. ABI SOLiD
is the commercial sequencer that uses this approach. In SOLiD, the templates
are first amplified by emPCR. After that, millions of templates are placed on
a plate. SOLiD then tries to probe the bases of all templates in parallel. In
every iteration, SOLiD probes two adjacent bases of each template, i.e., it uses
two-base color encoding. The color coding scheme is shown in the following
table. For example, for the DNA template AT GGA, it is coded as A3102.

A C G T
A 0 1 2 3
C 1 0 3 2
G 2 3 0 1
T 3 2 1 0
12 Algorithms for Next-Generation Sequencing

The primary advantage of the two-base color encoding is that it improves


the single nucleotide variation (SNV) calling. Since every base is covered by
two color bases, it reduces the error rate for calling SNVs. However, conversion
from color bases to nucleotide bases is not simple. Errors may be generated
during the conversion process.
In summary, second-generation sequencing enables us to generate hundreds
of billions of bases per run. However, each run takes days to finished due to a
large number of scanning and washing cycles. Adding of a base per cycle is not
100% correct. This causes sequencing errors. Furthermore, base extensions of
some strands may be lag behind or lead forward. Hence, errors accumulate
as the reads get long. This is the reason why second-generation sequencing
cannot get very long read. Furthermore, due to the PCR amplification bias,
this approach may miss some templates with high or low GC content.

1.5 Third-generation sequencing


Although many of us are still using second-generation sequencing, third-
generation sequencing is coming. There is no fixed definition for third-
generation sequencing yet. Here, we define it as a single molecule sequencing
(SMS) technology without the need to halt between read steps (whether enzy-
matic or otherwise). A number of third-generation sequencing methods have
been proposed. They include:

• Single-molecule real-time sequencing

• Nanopore-sequencing technologies

• Direct imaging of individual DNA molecules using advanced microscopy


techniques

1.5.1 Single-molecule real-time sequencing


Pacific BioSciences released their PacBio RS sequencing platform [71].
Their approach is called single-molecule real-time (SMRT) sequencing. It mim-
ics what happens in our body as cells divide and copies their DNA with the
DNA polymerase machine. Precisely, PacBio RS immobilizes DNA polymerase
molecules on an array slide. When the DNA template gets in touch with the
DNA polymerase, DNA synthesis happens with four fluorescently labeled nu-
cleotides. By detecting the light emitted, PacBio RS reconstructs the DNA
sequences. Figure 1.8 illustrates the SMRT sequencing approach.
PacBio RS sequencing requires no prior amplification of the DNA template.
Hence, it has no PCR bias. It can achieve more uniform coverage and lower GC
bias when compared with Illumina sequencing [79]. It can read long sequences
Introduction 13

Immobilized
polymerase

FIGURE 1.8: The illustration of PacBio sequencing. On an array slide,


there are a number of immobilized DNA polymerase molecules. When a DNA
template gets in touch with the DNA polymerase (see the polymerase at the
lower bottom right), DNA synthesis happens with the fluorescently labeled
nucleotides. By detecting the emitted light signal, we can reconstruct the
DNA sequence.

of length up to 20, 000 bp, with an average read length of about 10, 000 bp.
Another advantage of PacBio RS is that it can sequence methylation status
simultaneously.
However, PacBio sequencing is more costly. It is about 3 − 4 times more
expensive than short read sequencing. Also, PacBio RS has a high error rate,
up to 17.9% errors [46]. The majority of the errors are indel errors [71]. Luckily,
the error rate is unbiased and almost constant throughout the entire read
length [146]. By repeatedly sequencing the same DNA template, we can reduce
the error rate.

1.5.2 Nanopore sequencing method


A nanopore is a pore of nano size on a thin membrane. When a voltage
is applied across the membrane, charged molecules that are small enough can
move from the negative well to the positive well. Moreover, molecules with
different structures will have different efficiencies in passing through the pore
and affect the electrical conductivity. By studying the electrical conductivity,
we can determine the molecules that pass through the pore.
This idea has been used in a number of methods for sequencing DNA.
These methods are called the nanopore sequencing method. Since nanopore
methods use unmodified DNA, it requires an extremely small amount of input
material. They also have the potential to sequence long DNA reads efficiently
at low cost. There are a number of companies working on the nanopore se-
quencing method. They include (1) Oxford Nanopore, (2) IBM Transistor-
mediated DNA sequencing, (3) Genia and (4) NABsys.
Oxford nanopore technology detects nucleotides by measuring the ionic
current flowing through the pore. It allows the single-strand DNA sequence to
14 Algorithms for Next-Generation Sequencing

FIGURE 1.9: An illustration of the sequencing technique of Oxford


nanopore.

flow through the pore continuously. As illustrated in Figure 1.9, DNA material
is placed in the top chamber. The positive charge draws a strand of DNA
moving from the top chamber to the bottom chamber flowing through the
nanopore. By detecting the difference in the electrical conductivity in the
pore, the DNA sequence is decoded. (Note that IBM’s DNA transistor is a
prototype which uses a similar idea.)
The approach has difficulty in calling the individual base accurately. In-
stead, Oxford nanopore technology will read the signal of k (say 5) bases
in each round. Then, using a hidden Markov model, the DNA base can be
decoded base by base.
Oxford nanopore technology has announced two sequencers: MiniION and
GridION. MiniION is a disposable USB-key sequencer. GridION is an ex-
pandable sequencer. Oxford nanopore technology claimed that GridION can
sequence 30x coverage of a human genome in 6 days at US$2200 − $3600. It
has the potential to decode a DNA fragment of length 100, 000 bp. Its cost is
about US$25−$40 per gigabyte. Although it is not expensive, the error rate is
about 17.8% (4.9% insertion error, 7.8% deletion error and 5.1% substitution
error) [115].
Unlike Oxford nanopore technology, Genia suggested combining nanopore
and the DNA polymerase to sequence a single-strand DNA template. In Genia,
the DNA polymerase is tethered with a biological nanopore. When a DNA
template gets in touch with the DNA polymerase, DNA synthesis happens
with four engineered nucleotides for A, C, G and T , each attached with a
different short tag. When a nucleotide is incorporated into the DNA template,
the tag is cleaved and it will travel through the biological nanopore and an
electric signal is measured. Since different nucleotides have different tags, we
can reconstruct the DNA template by measuring the electric signals.
NABsys is another nanopore sequencer. It first chops the genome into DNA
fragments of length 100, 000 bp. The DNA fragments are hybridized with a
particular probe so that specific short DNA sequences on the DNA fragments
Introduction 15

(a) (b)

… …
(c)

FIGURE 1.10: Consider a DNA fragment hybridized with a particular


probe. After it passes through the nanopore (see (a)), an electrical signal
profile is obtained (see (b)). By aligning the electrical signal profiles gener-
ated from a set of DNA fragments, we obtain the probe map for a genome
(see (c)).

are bounded by the probes. Those DNA fragments with bound probes are
driven through a nanopore (see Figure 1.10(a)), creating a current-versus-
time tracing. The trace gives the position of the probes on the fragment.
(See Figure 1.10(b).) We can align the fragments based on their inter-probe
distance; then, we obtain a probe map for the genome (see Figure 1.10(c)).
We can obtain the probe maps for different probes. By aligning all of them,
we obtain the whole genome.

Unlike Genia, Oxford nanopore technology and the IBM DNA transis-
tor, NABsys does not require a very accurate current measurement from the
nanopore. The company claims that this method is cheap, and that read length
is long and fast. Furthermore, it is accurate!

1.5.3 Direct imaging of DNA using electron microscopy

Another choice is to use direct imaging. ZS genetics is developing meth-


ods based on transmission electron microscopy (TEM). Reveo is developing
a technology based on scanning tunneling microscope (STM) tips. DNA is
placed on a conductive surface for detecting bases electronically using STM
tips and tunneling current measurements. Both approaches have the potential
to sequence very long reads (in millions) at low cost. However, they are still
in the development phase. No sequencing machine is available yet.
16 Algorithms for Next-Generation Sequencing

TABLE 1.1: Comparison of the three generations of sequencing

First generation Second genera- Third generation


tion
Amplification In-vivo cloning In-vitro PCR Single molecule
and amplification
Sequencing Electrophoresis Cyclic array se- Nanopore, elec-
quencing tronic microscopy
or real-time
monitoring of
PCR
Starting ma- More Less (< 1µg) Even less
terial
Cost Expensive Cheap Very cheap
Time Very slow Fast Very fast
Read length About 800bp Short Very long
Accuracy < 1% error < 1% error High error rate
(mismatch or
homopolmer
error)

1.6 Comparison of the three generations of sequencing

We have discussed the technologies of the three generations of sequencing.


Table 1.1 summarizes their key features. Currently, we are in the late phase
of second-generation sequencing and at the early phase of third-generation
sequencing. We can already see a dramatic drop in sequencing cost. Figure 1.11
shows the sequence cost over time. Cost per genome is calculated based on
6-fold coverage for Sanger sequencing, 10-fold coverage for 454 sequencing
and 30-fold coverage for Illumina (or SOLiD) sequencing. As a matter of
fact, the sequencing cost does not include the data management cost and
the bioinformatics analysis cost. Note that there was a sudden reduction in
sequencing cost in January 2008, which is due to the introduction of second-
generation sequencing. In the future, the sequencing cost is expected to drop
further.
Introduction 17
$100,000,000.00

$10,000,000.00

$1,000,000.00

$100,000.00

$10,000.00

$1,000.00

$100.00

$10.00

$1.00

$0.10

$0.01

$0.00
Sep-01
Jan-02

Sep-02

Sep-03

Sep-04

Sep-11
May-02

Jan-03
May-03

Jan-04
May-04

Jan-05

Sep-05
May-05

Jan-06

Sep-06
May-06

Jan-07

Sep-07
May-07

Jan-08

Sep-08

Sep-09
May-08

Jan-09
May-09

Jan-10

Sep-10
May-10

Jan-11
May-11

Jan-12

Sep-12

Sep-13
May-12

Jan-13
May-13

Jan-14

Sep-14
May-14

Jan-15

Sep-15
May-15
Cost per Mb of DNA bases Cost per Genome

FIGURE 1.11: The sequencing cost over time. There are two curves. The
blue curve shows the sequencing cost per million of sequencing bases while
the red curve shows the sequencing cost per human genome. (Data is obtained
from http://www.genome.gov/sequencingcosts.)

1.7 Applications of sequencing


The previous section describes three generations of sequencing. This sec-
tion describes their applications.
Genome assembly: Genome assembly aims to reconstruct the genome
of some species. Since our genome is long, we still cannot read the whole
chromosome in one step. The current solution is to sequence the fragments of
the genome one by one using a sequencer. Then, by overlapping the fragments
computationally, the complete genome is reconstructed.
Many genome assembly projects have been finished. We have obtained
the reference genomes for many species, including human, mouse, rice, etc.
The human genome project is properly the most important genome assem-
bly project. This project started in 1984 and declared complete in 2003. The
project cost was more than 3 billion US$. Although it is expensive, the project
enables us to have a better understanding of the human genome. Given the
human reference genome, researchers can examine the list of genes in hu-
mans. We know that the number of protein coding genes in humans is about
20, 000, which covers only 3% of the whole genome. Subsequently, we can also
understand the differences among individuals and understand the differences
between cancerous and healthy human genomes.
The project also improves the genome assembly process. It leads to a
whole genome shotgun approach, which is the most common assembly ap-
proach nowadays. By coupling the whole genome shotgun approach and next-
18 Algorithms for Next-Generation Sequencing

generation sequencing, we obtain the reference genomes of many species. (See


Chapter 5 for methods to reconstruct a genome.)
Genome variations finding: The genome of each individual is different
from that of the reference human genome. Roughly speaking, there are four
types of genome variations: single nucleotide variations (SNVs), short indels,
copy number variations (CNVs) and structural variations (SVs). Figure 6.1
illustrates these four types of variations. Genome variations can cause can-
cer. For example, in chronic myelogenous leukemia (CML), a translocation
exists between chromosome 9 and chromosome 22, which fuses the ABL1 and
BCR genes together to form a fusion gene, BCL-ABL1. Such a translocation
is known to be present in 95 percent of patients with CML. Another example
occurs with a deletion in chromosome 21 that fuses the ERG and TMPRSS2
genes. The TMPRSS2-ERG fusion is seen in approximately 50 percent of
prostate cancers, and researchers have found that this fusion enhances the
invasiveness of prostate cancer. Genome sequencing of cancers enables us to
identify the variation of each individual. Apart from genome variations in can-
cers, many novel disease-causing variations have been discovered for childhood
diseases and neurological diseases. In the future, we expect everyone will per-
form genome sequencing. Depending on the variations, different personalized
therapeutics can be applied to different patients. This is known as personal-
ized medicine or stratified medicine. (See Chapters 7 and 6 for methods to
call genome variations.)
Reconstructing the transcriptome: Although every human cell has
the same human genome, human cells in different tissues express different
sets of genes at different times. The set of genes expressed in a particular cell
type is called its transcriptome. In the past, the transcriptome was extracted
using technologies like microarray. However, microarray can only report the
expression of known genes. They fail to discover novel splice variants and
novel genes. Due to the advance in sequencing technologies, we can use RNA-
seq to solve these problems. We can not only measure gene expression more
accurately, but can also discover novel genes and novel splice variants. (See
Chapter 8 for methods to analyze RNA-seq data.)
Decoding the transcriptional regulation: Some proteins called tran-
scription factors (TFs) bind on the genome and regulate the expression of
genes. If a TF fails to bind on the genome, the corresponding target gene will
fail to express and the cell cannot function properly. For example, one type
of breast cancer is ER+ cancer cells. In ER+ cancer, ER, GATA3 and FoxA1
form a functional enhanceosome that regulates a set of genes and drives the
core ERα function. It is important to understand how they work together.
To know the binding sites of each TF, we can apply ChIP-seq. ChIP-seq is
a sequencing protocol that enables us to identify the binding sites of each TF
on a genome-wide scale. By studying the ChIP-seq data, we can understand
how TFs work together, the relationship between TFs and transcriptomes,
etc. (See Chapter 9 for methods to analyze ChIP-seq data.)
Many other applications: Apart from the above applications, sequenc-
Introduction 19

ing has been applied to many other research areas, including metagenomics,
3D modeling of the genome, etc.

1.8 Summary and further reading


This chapter summarizes the three generations of sequencing. It also briefly
describes their applications. There are a number of good surveys of sec-
ond generation-sequencing. Please refer to [200]. For more detail on third-
generation sequencing, please refer to [263].

1.9 Exercises
1. Consider the DNA sequence 5’-ACTCAGTTCG-3’. What is its reverse
complement? The SOLiD sequencer will output color-based sequences.
What is the expected color-based sequence for the above DNA sequence
and its reverse complement? Do you observe an interesting property?
2. Should we always use second- or third- generation sequencing instead of
first-generation sequencing? If not, when should we use Sanger sequenc-
ing?
Chapter 2
NGS file formats

2.1 Introduction
NGS technologies are widely used now. To facilitate NGS data analysis
and NGS data transfer, a few NGS file formats are defined. This chapter gives
an overview of these commonly used file formats.
First, we briefly describe the NGS data analysis process. After a sample is
sequenced using a NGS sequencer, some reads (i.e., raw DNA sequences) are
generated. These raw reads are stored in raw read files, which are in the fasta,
fastq, fasta.gz or fastq.gz format. Then, these raw reads are aligned on the
reference genome (such as the human genome). The alignments of the reads
are stored in alignment files, which are in the SAM or BAM format.
From these alignment files, downstream analysis can be performed to un-
derstand the sample. For example, if the alignment files are used to call mu-
tations (like single nucleotide variants), we will obtain variant files, which are
in the VCF or BCF format. If the alignment files are used to obtain the read
density (like copy number of each genomic region), we will obtain density files,
which are in the Wiggle, BedGraph, BigWig or cWig format. If the alignment
files are used to define regions with read coverage (like regions with RNA
transcripts or regions with TF binding sites), we will obtain annotation files,
which are in the bed or bigBed format.
Figure 2.1 illustrates the relationships among these NGS file formats. In
the rest of the chapter, we detail these file formats.

Raw Data Alignment Annotation

VCF or BCF

Reads (in Alignment


Annotator or Bed or
fasta or Read Mapper (in SAM or
variant caller bigBed
fastq) BAM)

Wig,
BedGraph or
BigWig, cWig

FIGURE 2.1: The relationships among different file formats.

21
22 Algorithms for Next-Generation Sequencing

ACTCAGCACCTTACGGCGTGCATCATCACATCAGGGACATACCAATACGGACAACCATCCCAAATATATTACGTTCTGAACGGCAGTACAAACT
(a)

ACTCAGCACCTTACGGCGTGCATCA
(b)

ACTCAGCACCTTACGGCGTGCATCA TACGTTCTGAACGGCAGTACAAACT
(c)

FIGURE 2.2: (a) An example DNA fragment. (b) A single-end read ex-
tracted from the DNA fragment in (a). (c) A paired-end read extracted from
the DNA fragment in (a).

2.2 Raw data files: fasta and fastq


From DNA fragments, NGS technologies extract reads from them. Fig-
ure 2.2 demonstrates the process. Given a DNA fragment in Figure 2.2(a),
NGS technologies enable us to sequence either one end or both ends of the
fragment. Then, we obtain a single-end read or a paired-end read (see Fig-
ure 2.2(b,c)).
To store the raw sequencing reads generated by a sequencing machine, two
standard file formats, fasta and fastq, are used.
The fasta file has the simplest format. Each read is described by (1) a read
identifier (a line describing the identifier of a read) and (2) the read itself (one
or more lines). Figure 2.3(a) gives an example consisting of two reads. The
line describing the read identifier must start with >. There are two standard
formats for the read identifier: (a) the Sanger standard and (b) the Illumina
standard. The Sanger standard uses free text to describe the read identifier.
The Illumina standard uses a fixed read identifier format (see Figure 2.4 for
an example).
Sequencing machines are not 100% accurate. A number of sequencers (like
the Illumina sequencer) can estimate the base calling error probability P
of each called base. People usually convert the error probability P into the
PHRED quality score Q (proposed by Ewing and Green [73]) which is com-
puted by the following equation:

Q = −10 ∗ log10 P.

Q is truncated and limited to only integers in the range 0..93. Intuitively, if


Q is big, we have high confidence that the base called is correct. To store Q,
the value Q is offset by 33 to make it in the range from 33 to 126 which are
ASCII codes of printable characters. This number is called the Q-score.
The file format fastq stores both the DNA bases and the corresponding
quality scores. Figure 2.3(b) gives an example. In a fastq file, each read is
described by 4 parts: (1) a line that starts with @ which contains the read
identifier, (2) the DNA read itself (one or more lines), (3) a line that starts
NGS file formats 23

>seq1 @seq1
ACTCAGCACCTTACGGCGTGCATCA ACTCAGCACCTTACGGCGTGCATCA
>seq2 +seq1
CCGTACCGTTGACAGATGGTTTACA !’’**()%A54l;djgp0i345adn
@seq2
CCGTACCGTTGACAGATGGTTTACA
+seq2
#$SGl2j;askjpqo2i3nz!;lak

(a) (b)

FIGURE 2.3: (a) is an example fasta file while (b) is an example fastq file.
Both (a) and (b) describe the same DNA sequences. Moreover, the fastq file
also stores the quality scores. Note that seq1 is the read obtained from the
DNA fragment in Figure 2.2(b).

the member of a pair, /1 or /2


(paired-end or mate-pair reads
@HWUSI-EAS100R:6:73:941:1973#0/1 only)

X and y
index number for a
Flowcell lane coordinate of
multiplexed sample
the cluster
Instrument name Tile number (#0 for no indexing)
within the tile

FIGURE 2.4: An example of the read identifier in the fasta or fastq file.
24 Algorithms for Next-Generation Sequencing

with + and (4) the sequence of quality scores for the read (which is of the
same length as the read).
The above description states the format for storing the single-end reads.
For paired-end reads, we use two fastq files (suffixed with 1.fastq and 2.fastq).
The two files have the same number of reads. The ith paired-end read is
formed by the ith read in the first file and the ith read in the second
file. Note that the ith read in the second file is the reverse complement.
For example, for the paired-end read in Figure 2.2(c), the sequence in the
first file is ACTCAGCACCTTACGGCGTGCATCA while the sequence in the second
file is AGTTTGTACTGCCGTTCAGAACGTA (which is the reverse complement of
TACGTTCTGAACGGCAGTACAAACT).
Given the raw read data, we usually need to perform a quality
check prior to further analysis. The quality check not only can tell
us the quality of the sequencing library, it also enables us to fil-
ter reads that have low complexity and to trim some positions of
a read that have a high chance of sequencing error. A number of
tools can be used to check quality. They include SolexaQA[57], FastQC
(http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) and PRINSEQ[265].

2.3 Alignment files: SAM and BAM


The raw reads (in fasta or fastq format) can be aligned on some reference
genome. For example, Figure 2.5(a) shows the alignments of a set of four reads
r1, r2, r3 and r4. r1 is a paired-end read while r2, r3 and r4 are single-end
reads. The single-end read r3 has two alignments. Chapter 4 discusses different
mapping algorithms for aligning reads.
The read alignments are usually stored using SAM or BAM format [169].
The SAM (Sequence Alignment/Map) format is a generic format for storing
alignments of NGS reads. The BAM (Binary Alignment/Map) format is the
binary equivalent of SAM. It compresses the SAM file using the BGZF library
(a generic library developed by Li et al. [169]). BGZF divides the SAM file
into blocks and compresses each block using the standard gzip. It provides
good compression while allowing fast random access.
The detailed description of SAM and BAM formats can be found in
https://samtools.github.io/hts-specs/SAMv1.pdf. Briefly, SAM is a
tab-delimited text file. It consists of a header section (optional) and an align-
ment section.
The alignment section stores the alignments of the reads. Each row cor-
responds to one alignment of a read. A read with c alignments will occupy
c rows. For a paired-end read, the alignments of its two reads will appear in
different rows. Each row has 11 mandatory fields: QNAME, FLAG, RNAME,
POS, MAPQ, CIGAR, MRNM, MPOS, TLEN, SEQ and QUAL. Optional
NGS file formats 25

1 2 3 4
1234567 8901234567890123456789012345678901234567
ref: ATCGAAC**TGACTGGACTAGAACCGTGAATCCACTGATCTAGACTCGA
+r1/1 TCG-ACGGTGACTG
+r2 GAAC-GTGACaactg
-r3 GCCTGGttcac
-r1/2 CCGTGAATC
+r3 GTGAAccaggc
+r4 ATCC……………………AGACTCGA

(a)

@HD VN:1.3 S0:coordinate


@SQ SN:ref LN:47
r1 99 ref 2 30 3M1D2M2I6M = 22 29 TCGACGGTGACTG *
r2 0 ref 4 30 4M1P1I4M5S * 0 0 GAACGTGACAACTG *
r3 16 ref 9 30 6M5S * 0 0 GCCTGGTTCAC * NM:i:1
r1 147 ref 22 30 9M = 2 -29 CCGTGAATC *
r3 2048 ref 24 30 5M6H * 2 0 GTGAA * NM:i:0
r4 0 ref 28 30 4M8N8M * 0 0 ATCCAGACTCGA *

QNAME FLAG RNAME POS MAPQ CIGAR MRNM MPOS TLEN SEQ QUAL TAG:TYPE:VAL

(b)

FIGURE 2.5: (a) An example of alignments of a few short reads. (b) The
SAM file for the corresponding alignments is shown in (a).

fields can be appended after these mandatory fields. For each field, if its value
is not available, we set it to *. We briefly describe the 11 mandatory fields
here.

• QNAME is the name of the aligned read. (In Figure 2.5, the read names
are r1, r2, r3, and r4. Note that r3 has two alignments. Hence, there
are two rows with r3. r1 is a paired-end read. Hence, there are two rows
for the two reads of r1.)

• FLAG is a 16-bit integer describing the characteristics of the alignment


of the read (see Section 2.3.1 for detail).

• RNAME is the name of the reference sequence to which the read is


aligned. (In Figure 2.5, the reference is ref.)

• POS is the leftmost position of the alignment.

• MAPQ is the mapping quality of the alignment (in Phred-scaled). (In


Figure 2.5, the MAPQ scores of all alignments are 30.)

• CIGAR is a string that describes the alignment between the read and
the reference sequence (see Section 2.3.2 for detail).

• MRNM is the name of the reference sequence of its mate read (“=” if it
is the same as RNAME). For single-end read, its value is “*” by default.
26 Algorithms for Next-Generation Sequencing

• MPOS is the leftmost position of its mate read. For a single-end read,
its value is zero by default.

• TLEN is the inferred signed observed template length (i.e., the inferred
insert size for paired-end read). For a single-end read, its value is zero
by default.

• SEQ is the query sequence on the same strand as the reference.

• QUAL is the quality score of the query sequence.

Figure 2.5(b) gives the SAM file for the example alignments in Fig-
ure 2.5(a). The following subsections give more information for FLAG and
CIGAR.
Note that SAM and BAM use two different coordinate systems. SAM uses
the 1-based coordinate system. The first base of a sequence is 1. A region is
specified by a close interval. For example, the region between the 3rd base
and the 6th base inclusive is [3, 6]. BAM uses the 0-based coordinate system.
The first base of a sequence is 0. A region is specified by a half-closed, half-
open interval. For example, the region between the 3rd base and the 6th base
inclusive is [3, 7).
To manipulate the information from SAM and BAM files, we use sam-
tools [169] and bamtools [16]. Samtools is a set of tools that allows us to
interactively view the alignments in both SAM and BAM files. It also al-
lows us to post-process and extracts information from both SAM and BAM
files. Bamtools provides a set of C++ API for us to access and manage the
alignments in the BAM file.

2.3.1 FLAG
The bitwise FLAG is a 16-bit integer describing the characteristics of the
alignment. Its binary format is b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 . Only
bits b0 , . . . , b11 are used by SAM. The meaning of each bit bi is described in
Table 2.1.
For example, for the first read of r1 in Figure 2.5(b), its flag is 99 =
0000000001100011. b0 , b1 , b5 and b6 are ones. This means that it has multiple
segments (b0 ), the segment is properly aligned (b1 ), the next segment maps
on the reverse complement (b5 ), and this is the first read in the pair (b6 ).

2.3.2 CIGAR string


The CIGAR string is used to describe the alignment between the read and
the reference genome. The CIGAR operations are given in Table 2.2.
For example, the CIGAR string for the first read of r1 is 3M1D2M2I6M
(see Figure 2.5), which means that the alignment between the read and the
NGS file formats 27

TABLE 2.1: The meaning of the twelve bits in the FLAG field.

Bit Description
b0 = 1 if the read is paired
b1 = 1 if the segment is a proper pair
b2 = 1 if the segment is unmapped
b3 = 1 if the mate is unmapped
b4 = 1 if the read maps on the reverse complement
b5 = 1 if the mate maps on the reverse complement.
b6 = 1 if this is the first read in the pair
b7 = 1 if this is the second read in the pair
b8 = 1 if this is a secondary alignment (i.e., the alternative alignment
when multiple alignments exist)
b9 = 1 if this alignment does not pass the quality control
b10 = 1 if it is a PCR duplicate
b11 = 1 if it is a supplementary alignment (part of a chimeric align-
ment)

reference genome has 3 alignment matches, 1 deletion, 2 alignment matches,


2 insertions and 6 alignment matches. (Note that the alignment match can be
a base match/mismatch.)

2.4 Bed format


The bed format is a flexible way to represent and annotate a set of genomic
regions. It can be used to annotate repeat regions in the genome, open regions
or transcription factor binding sites in the genome. We can also use it to
annotate genes with different isoforms. For example, Figure 2.6 shows the bed
file for representing the two isoforms of the gene VHL.
The bed format is a row-based format, where each row represents an anno-
tated item. Each annotated item consists of some fields separated by tabs. The
first three fields of each row are compulsory. They are chromosome, chromo-
some start and chromosome end. These three fields are used to indicate the ge-
nomic regions. (Similar to BAM, bed also uses the 0-based coordinate system.
Any genomic region is specified by a half-closed, half-open interval.) There
are 9 additional optional fields to annotate each genomic region: name, score,
strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, and block-
Starts. The bed format allows user to include more fields for each annotated
item.
As an illustration, the two isoforms of the gene VHL are represented using
two rows in the bed format. The bottom of Figure 2.6 shows the bed file for
the two isoforms. Since both isoforms are in chr3 : 10183318 − 10195354,
28 Algorithms for Next-Generation Sequencing

TABLE 2.2: Definitions of Different CIGAR Operations.

Op Description
M alignment match (can be a se-
quence match or mismatch)
I insertion to the reference
D deletion from the reference
N skipped region from the reference
S soft clipping (clipped sequences
present in SEQ)
H hard clipping (clipped sequences
NOT present in SEQ)
P padding (silent deletion from
padded reference)
= sequence match
X sequence mismatch

1,0184,000 1,0186,000 1,0188,000 1,0190,000 1,0192,000 1,0194,000

Bed file:

Alternative Education PlanAlternative Education PlanPlan

FIGURE 2.6: The gene VHL has two splicing variants. One of them has 3
exons while the other one has 2 exons (is missing the middle exon). The solid
bars are the exons. The thick solid bars are the coding regions of VHL. The
bed file corresponding to these two isoforms is shown at the bottom of the
figure.
NGS file formats 29

chromosome, chromosome start, and chromosome end are set to be chr3,


10183318 and 10195354, respectively. Since the coding regions are in chr3 :
10183531 − 10191649, thickStart and thickEnd are set to be 10183531 and
10191649, respectively. The two isoforms are in black color, itemRgb is set to
be 0, 0, 0. In these two isoforms, three exons A, B and C are involved. The
sizes of the three exons are 490, 123 and 3884, respectively. Relative to the
start of the gene VHL, the starting positions of the three exons are 0, 4879
and 8152, respectively. The first isoform consists of all three exons A, B and
C while the second isoform consists of exons A and C. So, for the first iso-
form, the blockCount is 3, blockSizes are 490, 123 and 3884 and blockStarts
are 0, 4879 and 8152. For the second isoform, the blockCount is 2, blockSizes
are 490 and 3884 and blockStarts are 0 and 8152.
Bed files are uncompressed files. They can be huge. To reduce the file size,
we can use the compressed version of the bed format, which is the bigBed
format. BigBed partitions the data into blocks of size 512 bp. Each block is
compressed using a gzip-like algorithm. Efficient querying for bigBed is im-
portant since users want to extract information without scanning the whole
file. BigBed includes an R-tree-like data structure that enables us to list, for
any interval, all annotated items in the bed file. BigBed can also report sum-
mary statistics, like mean coverage, and count the number of annotated items
within any genomic interval (with the help of the command bigBedSummary).
To manipulate the information in bed files, we use bedtools [237]. To con-
vert between bed format and bigBed format, we can use bedToBigbed and
bigBedToBed. To obtain a bam file from a bed file, we can use bedToBam.

2.5 Variant Call Format (VCF)


VCF [60] is the format specially designed to store the genomic variations,
which are generated by variant callers (see Chapter 6). It is the standardized
format used in the 1000 Genome Project. Information stored in a VCF file
includes genomic coordinates, reference allele and alternate allele, etc. Since
VCF may be big, Binary Variant Call Format (BCF) was also developed.
BCF is a binary version of VCF, which is compressed and indexed for efficient
analysis of various calling results.
Figure 2.7 illustrates the VCF file. It consists of a header section and a data
section. The header section contains arbitrary number of meta-information
lines (begin with characters ##) followed by a TAB-delimited field definition
line (begins with the character #). The meta-information lines describe the
tags and annotations used in the data section. The field definition line defines
the columns in the data section. There are 8 mandatory fields: CHROM, POS,
ID, REF, ALT, QUAL, FILTER and INFO. CHROM and POS define the
position of each genomic variation. REF gives the reference allele while ALT
30 Algorithms for Next-Generation Sequencing

(a) Chr1 1111111111222222 22223333333333444444444455555555556666666666777777


1234567890123456789012345 67890123456789012345678901234567890123456789012345
REF: ACGTACAGACAGACTTAGGACAGAT--CGTCACACTCGGACTGACCGTCACAACGGTCATCACCGGACTTACAATCG

Sample1: GTACACACAGAC CAGATAACGTCAC CGGACTGACCGTCA AACGGT--------------CAATCG


ACACACAGACTT
CACACAGACTTA

Sample2: ACGTACAGACAG GACAGATAACGTC TCGGACT---CG ACAACGGT--------------CAAT


CGTACAGACAGA GGACAGATT-CGT CAACGGT--------------CAATC
AGGACAGATT-CGT

##fileformat=VCFv4.2
(b) ##fileDate=20110705
##source=VCFtools
##reference=NCBI36
##ALT=<ID=DEL,Description="Deletion">
##FILTER=<ID=q10,Description="Quality below 10">
##INFO=<ID=SVTYPE,umber=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality (phred score)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2
1 8 . G C . PASS . GT:DP 1/1:3 0/0:2
1 25 . T TAA,TT . q10 . GT:DP 1/1:1 1/2:3
1 40 . TGAC T . PASS . GT:GQ 1/1:50 0/0:70
1 55 . T <DEL> . PASS SVTYPE=DEL;END=69 GT 1/1 1/1

FIGURE 2.7: Consider two samples. Each sample has a set of length-12
reads. Sample (a) shows the alignment of those reads on the reference genome.
There are 4 regions with genomic variations. At position 8, an SNV appears in
sample 1. At position 25, two different insertions appear in both samples. At
position 40, a deletion appears in sample 2. At position 55, a deletion appears
in both samples. Sample (b) shows the VCF file that describes the genomic
variations in the 4 loci.

gives a list of alternative non-reference alleles (separated by semi-colon). ID is


the unique identifier of the variant, QUAL is the PHRED-scaled quality score,
FILTER stores the site filtering informations and INFO stores the annotation
of the variant.
If n samples are included in the VCF file, we have an additional field FOR-
MAT and n other fields, each representing a sample. The FORMAT field is
a colon-separated list of tags, and each tag defines some attribute of a geno-
type. For example, GT:GQ:DP corresponds to a format with three attributes:
genotype, genotype quality, and read depth of each variant. Read depth (DP)
is the number of reads covered by the locus. Genotype quality (GQ) is the
phred-scaled quality score of the genotype in the sample. Genotype (GT) is
encoded as allele values separated by either / or |. The allele values can be
0 for the reference allele (in REF field), 1 for the first alternative allele (in
ALT field), 2 for the second alternative allele, and so on. For example, 0/1 is a
diploid call that consists of the reference allele and the first alternative allele.
The separators / and | mean unphased and phased genotypes, respectively.
To manipulate the information in VCF and BCF files, we use vcftools [60]
NGS file formats 31

40
30
20
10

1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90

FIGURE 2.8: An example of a density curve in chr1 : 1 − 90.

and bcftools [163]. (Note that VCF files use the 1-based coordinate system
while BCF files use the 0-based coordinate system.)

2.6 Format for representing density data


Read density measures the number of reads covering each base. It is formed
by aligning the NGS reads on a reference genome; then, the number of reads
covering each base can be counted.
Density data can describe the transcript expression in RNA-seq (see Sec-
tion 8), the peak intensity in ChIP-seq (see Chapter 9), the open chromatin
profile (see Chapter 9), the copy number variation in whole genome sequencing
(see Chapter 7), etc.
Density data can be represented as a list of ri , where ri is the density
value for position i. Figure 2.8 shows an example list of density values. In the
example, r5 = 0, r10 = 10, r17 = 20, r20 = 30. To represent the list of ri , we
can use the bedGraph or wiggle (wig) format.
The wiggle format is a row-based format. It consists of one track definition
line (defining the format of the lines) and a few lines of data points. There
are two options: variableStep and fixedStep. Variable step data describe the
densities of data points in irregular intervals. It has an optional parameter
span s, which is the number of bases spanned with the same density for each
data point. To describe variable step data, the first line is the track definition
line, which contains information about the chromosome and the span s. Then,
it is followed with some lines, where each line contains two numbers: a position
and a density value for that position.
Fixed step data describe the densities of data points in regular interval. To
describe fixed step data, the first line is the track definition line, which contains
the chromosome, the start b, the step δ and the span s. Then, it follows with
some lines, where the ith line contains the density value for position b + δi,
32 Algorithms for Next-Generation Sequencing

which spans s bases. The following box gives the wiggle file for Figure 2.8.
(Note that the wiggle file uses a 1-based coordinate system.)

fixedStep chrom=chr1 start=10 step=5 span=5


10
20
30
20
30
20
10
variableStep chrom=chr1 span=5
50 20
60 30
65 40
70 20
80 40

BedGraph is in a bed-like format where each line represents a genomic


region with the same density value. Each line consists of 4 fields: chromosome,
chromosome start, chromosome end, and density. (Note that bedGraph uses
0-based coordinate format. The coordinate is described in a half-closed, half-
open interval.) The following box gives the bedGraph file for Figure 2.8.

chr1 9 14 10
chr1 14 19 20
chr1 19 24 30
chr1 24 29 20
chr1 29 34 30
chr1 34 39 20
chr1 39 44 10
chr1 49 54 20
chr1 59 64 30
chr1 64 69 40
chr1 69 74 20
chr1 79 84 40

However, wiggle and bedGraph are uncompressed text formats and they
are usually huge. Hence, when we want to represent a genome-wide density
profile, we usually use the compressed version, which is in the bigWig for-
NGS file formats 33

mat [135] or cWig format [112]. Both bigWig and cWig support efficient
queries over any selected interval. Precisely, they support 4 operations. Let
the density values of the N bases be r1 , . . . , rN . For any chromosome interval
p..q, the four operations are:

• coverage(p, q): Proportion of bases that contain non-zero density values


N
in p..q, that is, q−p+1 .

1
PN
• mean(p, q): The mean of the density values in p..q, that is, N i=1 ri .

• minV al(p, q) and maxV al(p, q): The minimum and maximum of the
density values in p..q, that is, mini=1..N ri and maxi=1..N ri .

• stdev(p,
q P q): The standard derivation of the density values in p..q, that
1 N
is, N i=1 (rk − mean(p, q))2 .

To manipulate the information in bigWig and cWig files, we use bw-


tool [234] and cWig tools [112].

2.7 Exercises
1. Suppose chromosome 2 is the following sequence ACACGACTAA . . ..

• For the genomic region in chromosome 2 containing the DNA se-


quence ACGAC, if we describe it using bed format, what are the
chromosome, start, and end?
• In SAM, what is the alignment position of the DNA sequence
ACGAC?
• In BAM, what is the alignment position of the DNA sequence
ACGAC?

2. Please perform the following conversions.

(a) Convert the following set of intervals from the 0-based coordi-
nate format to the 1-based coordinate format: 3..100, 0..89 and
1000..2000.
(b) Convert the following set of intervals from the 1-based coordi-
nate format to the 0-based coordinate format: 3..100, 1..89 and
1000..2000.

3. Given a BAM file input.bam, we want to find all alignments with maQ >
0 using samtools. What should be the command?
34 Algorithms for Next-Generation Sequencing

4. Given two BED files input1.bed and input2.bed, we want to find all
genomic regions in input1.bed that overlap with some genomic regions
in input2.bed. What should be the command?
5. For the following wiggle file, can you compute coverage(3, 8),
mean(3, 8), minV al(3, 8), maxV al(3, 8) and stdev(3, 8)?

fixedStep chrom=chr1 start=1 step=1 span=1


20
10
15
30
20
25
30
20
10
30

6. Can you propose a script to convert a BAM file into a bigWig file?
Chapter 3
Related algorithms and data
structures

3.1 Introduction
This chapter discusses various algorithmic techniques and data structures
used in this book. They include:
• Recursion and dynamic programming (Section 3.2)
• Parameter estimation and the EM algorithm (Section 3.3)
• Hash data structures (Section 3.4)
• Full-text index (Section 3.5)
• Data compression techniques (Section 3.6)

3.2 Recursion and dynamic programming


Recursion (also called divide-and-conquer) and dynamic programming are
computational techniques that solve a computational problem by partitioning
it into subproblems. The development of recursion and dynamic programming
algorithms involves three stages.
1. Characterize the structure of the computational problem: Iden-
tify the subproblems of the computational problem.
2. Formulate the recursive equation: Define an equation that describes
the relationship between the problem and its subproblems.
3. Develop an algorithm: Based on the recursive equation, develop ei-
ther a recursive algorithm or a dynamic programming algorithm.
Below, we illustrate these three steps using two computational problems:
(1) a key searching problem and (2) an edit-distance problem.

35
Another Random Scribd Document
with Unrelated Content
The Project Gutenberg eBook of Mrs Dalloway
in Bond Street
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.

Title: Mrs Dalloway in Bond Street

Author: Virginia Woolf

Release date: September 3, 2020 [eBook #63107]


Most recently updated: October 18, 2024

Language: English

Credits: Produced by Laura Natal Rodrigues at Free Literature


(Images
generously made available by Hathi Trust.)

*** START OF THE PROJECT GUTENBERG EBOOK MRS DALLOWAY


IN BOND STREET ***
MRS DALLOWAY IN BOND STREET
BY VIRGINIA WOOLF
THE

DIAL

VOLUME LXXV

July to December, 1923

THE DIAL PUBLISHING COMPANY

MRS DALLOWAY IN BOND STREET

BY VIRGINIA WOOLF
Mrs Dalloway said she would buy the gloves herself.
Big Ben was striking as she stepped out into the street. It was
eleven o'clock and the unused hour was fresh as if issued to children
on a beach. But there was something solemn in the deliberate swing
of the repeated strokes; something stirring in the murmur of wheels
and the shuffle of footsteps.
No doubt they were not all bound on errands of happiness. There is
much more to be said about us than that we walk the streets of
Westminster. Big Ben too is nothing but steel rods consumed by rust
were it not for the care of H. M.'s Office of Works. Only for Mrs
Dalloway the moment was complete; for Mrs Dalloway June was
fresh. A happy childhood—and it was not to his daughters only that
Justin Parry had seemed a fine fellow (weak of course on the
Bench); flowers at evening, smoke rising; the caw of rooks falling
from ever so high, down down through the October air—there is
nothing to take the place of childhood. A leaf of mint brings it back;
or a cup with a blue ring.
Poor little wretches, she sighed, and pressed forward. Oh, right
under the horses' noses, you little demon! and there she was left on
the kerb stretching her hand out, while Jimmy Dawes grinned on the
further side.
A charming woman, poised, eager, strangely white-haired for her
pink cheeks, so Scope Purvis, C. B., saw her as he hurried to his
office. She stiffened a little, waiting for Durtnall's van to pass. Big
Ben struck the tenth; struck the eleventh stroke. The leaden circles
dissolved in the air. Pride held her erect, inheriting, handing on,
acquainted with discipline and with suffering. How people suffered,
how they suffered, she thought, thinking of Mrs Foxcroft at the
Embassy last night decked with jewels, eating her heart out,
because that nice boy was dead, and now the old Manor House
(Durtnall's van passed) must go to a cousin.
"Good morning to you!" said Hugh Whitbread raising his hat rather
extravagantly by the china shop, for they had known each other as
children. "Where are you off to?"
"I love walking in London" said Mrs Dalloway. "Really it's better than
walking in the country!"
"We've just come up" said Hugh Whitbread. "Unfortunately to see
doctors."
"Milly?" said Mrs Dalloway, instantly compassionate.
"Out of sorts," said Hugh Whitbread. "That sort of thing. Dick all
right?"
"First rate!" said Clarissa.
Of course, she thought, walking on, Milly is about my age—fifty—
fifty-two. So it is probably that, Hugh's manner had said so, said it
perfectly—dear old Hugh, thought Mrs Dalloway, remembering with
amusement, with gratitude, with emotion, how shy, like a brother—
one would rather die than speak to one's brother—Hugh had always
been, when he was at Oxford, and came over, and perhaps one of
them (drat the thing!) couldn't ride. How then could women sit in
Parliament? How could they do things with men? For there is this
extraordinarily deep instinct, something inside one; you can't get
over it; it's no use trying; and men like Hugh respect it without our
saying it, which is what one loves, thought Clarissa, in dear old
Hugh.
She had passed through the Admiralty Arch and saw at the end of
the empty road with its thin trees Victoria's white mound, Victoria's
billowing motherliness, amplitude and homeliness, always ridiculous,
yet how sublime, thought Mrs Dalloway, remembering Kensington
Gardens and the old lady in horn spectacles and being told by Nanny
to stop dead still and bow to the Queen. The flag flew above the
Palace. The King and Queen were back then. Dick had met her at
lunch the other day—a thoroughly nice woman. It matters so much
to the poor, thought Clarissa, and to the soldiers. A man in bronze
stood heroically on a pedestal with a gun on her left hand side—the
South African war. It matters, thought Mrs Dalloway walking towards
Buckingham Palace. There it stood four-square, in the broad
sunshine, uncompromising, plain. But it was character she thought;
something inborn in the race; what Indians respected. The Queen
went to hospitals, opened bazaars—the Queen of England, thought
Clarissa, looking at the Palace. Already at this hour a motor car
passed out at the gates; soldiers saluted; the gates were shut. And
Clarissa, crossing the road, entered the Park, holding herself upright.
June had drawn out every leaf on the trees. The mothers of
Westminster with mottled breasts gave suck to their young. Quite
respectable girls lay stretched on the grass. An elderly man, stooping
very stiffly, picked up a crumpled paper, spread it out flat and flung it
away. How horrible! Last night at the Embassy Sir Dighton had said
"If I want a fellow to hold my horse, I have only to put up my hand."
But the religious question is far more serious than the economic, Sir
Dighton had said, which she thought extraordinarily interesting, from
a man like Sir Dighton. "Oh, the country will never know what it has
lost" he had said, talking, of his own accord, about dear Jack
Stewart.
She mounted the little hill lightly. The air stirred with energy.
Messages were passing from the Fleet to the Admiralty. Piccadilly
and Arlington Street and the Mall seemed to chafe the very air in the
Park and lift its leaves hotly, brilliantly, upon waves of that divine
vitality which Clarissa loved. To ride; to dance; she had adored all
that. Or going long walks in the country, talking, about books, what
to do with one's life, for young people were amazingly priggish—oh,
the things one had said! But one had conviction. Middle age is the
devil. People like Jack'll never know that, she thought; for he never
once thought of death, never, they said, knew he was dying. And
now can never mourn—how did it go?—a head grown grey. . . .
From the contagion of the world's slow stain . . . have drunk their
cup a round or two before. . . . From the contagion of the world's
slow stain! She held herself upright.
But how Jack would have shouted! Quoting Shelley, in Piccadilly!
"You want a pin," he would have said. He hated frumps. "My God
Clarissa! My God Clarissa!"—she could hear him now at the
Devonshire House party, about poor Sylvia Hunt in her amber
necklace and that dowdy old silk. Clarissa held herself upright for
she had spoken aloud and now she was in Piccadilly, passing the
house with the slender green columns, and the balconies; passing
club windows full of newspapers; passing old Lady Burdett Coutts'
house where the glazed white parrot used to hang; and Devonshire
House, without its gilt leopards; and Claridge's, where she must
remember Dick wanted her to leave a card on Mrs Jepson or she
would be gone. Rich Americans can be very charming. There was St
James palace; like a child's game with bricks; and now—she had
passed Bond Street—she was by Hatchard's book shop. The stream
was endless—endless—endless. Lords, Ascot, Hurlingham—what was
it? What a duck, she thought, looking at the frontispiece of some
book of memoirs spread wide in the bow window, Sir Joshua
perhaps or Romney; arch, bright, demure; the sort of girl—like her
own Elizabeth—the only real sort of girl. And there was that absurd
book, Soapy Sponge, which Jim used to quote by the yard; and
Shakespeare's Sonnets. She knew them by heart. Phil and she had
argued all day about the Dark Lady, and Dick had said straight out at
dinner that night that he had never heard of her. Really, she had
married him for that! He had never read Shakespeare! There must
be some little cheap book she could buy for Milly—Cranford of
course! Was there ever anything so enchanting as the cow in
petticoats? If only people had that sort of humour, that sort of self-
respect now, thought Clarissa, for she remembered the broad pages;
the sentences ending; the characters—how one talked about them
as if they were real. For all the great things one must go to the past,
she thought. From the contagion of the world's slow stain. . . . Fear
no more the heat o' the sun. . . . And now can never mourn, can
never mourn, she repeated, her eyes straying over the window; for it
ran in her head; the test of great poetry; the moderns had never
written anything one wanted to read about death, she thought; and
turned.
Omnibuses joined motor cars; motor cars vans; vans taxicabs;
taxicabs motor cars—here was an open motor car with a girl, alone.
Up till four, her feet tingling, I know, thought Clarissa, for the girl
looked washed out, half asleep, in the corner of the car after the
dance. And another car came; and another. No! No! No! Clarissa
smiled good-naturedly. The fat lady had taken every sort of trouble,
but diamonds! orchids! at this hour of the morning! No! No! No! The
excellent policeman would, when the time came, hold up his hand.
Another motor car passed. How utterly unattractive! Why should a
girl of that age paint black round her eyes? And a young man, with a
girl, at this hour, when the country—The admirable policeman raised
his hand and Clarissa acknowledging his sway, taking her time,
crossed, walked towards Bond Street; saw the narrow crooked
street, the yellow banners; the thick notched telegraph wires
stretched across the sky.
A hundred years ago her great-great-grandfather, Seymour Parry,
who ran away with Conway's daughter, had walked down Bond
Street. Down Bond Street the Parrys had walked for a hundred
years, and might have met the Dalloways (Leighs on the mother's
side) going up. Her father got his clothes from Hill's. There was a roll
of cloth in the window, and here just one jar on a black table,
incredibly expensive; like the thick pink salmon on the ice block at
the fishmonger's. The jewels were exquisite—pink and orange stars,
paste, Spanish, she thought, and chains of old gold; starry buckles,
little brooches which had been worn on sea green satin by ladies
with high head-dresses. But no good looking! One must economize.
She must go on past the picture dealer's where one of the odd
French pictures hung, as if people had thrown confetti—pink and
blue—for a joke. If you had lived with pictures (and it's the same
with books and music) thought Clarissa, passing the Aeolian Hall,
you can't be taken in by a joke.
The river of Bond Street was clogged. There, like a Queen at a
tournament, raised, regal, was Lady Bexborough. She sat in her
carriage, upright, alone, looking through her glasses. The white
glove was loose at her wrist. She was in black, quite shabby, yet,
thought Clarissa, how extraordinarily it tells, breeding, self-respect,
never saying a word too much or letting people gossip; an
astonishing friend; no one can pick a hole in her after all these
years, and now, there she is, thought Clarissa, passing the Countess
who waited powdered, perfectly still, and Clarissa would have given
anything to be like that, the mistress of Clarefield, talking politics,
like a man. But she never goes anywhere, thought Clarissa, and it's
quite useless to ask her, and the carriage went on and Lady
Bexborough was borne past like a Queen at a tournament, though
she had nothing to live for and the old man is failing and they say
she is sick of it all, thought Clarissa and the tears actually rose to her
eyes as she entered the shop.
"Good morning" said Clarissa in her charming voice. "Gloves" she
said with her exquisite friendliness and putting her bag on the
counter began, very slowly, to undo the buttons. "White gloves" she
said. "Above the elbow" and she looked straight into the
shopwoman's face—but this was not the girl she remembered? She
looked quite old. "These really don't fit" said Clarissa. The shop girl
looked at them. "Madame wears bracelets?" Clarissa spread out her
fingers. "Perhaps it's my rings." And the girl took the grey gloves
with her to the end of the counter.
Yes, thought Clarissa, if it's the girl I remember she's twenty years
older. . . . There was only one other customer, sitting sideways at the
counter, her elbow poised, her bare hand drooping, vacant; like a
figure on a Japanese fan, thought Clarissa, too vacant perhaps, yet
some men would adore her. The lady shook her head sadly. Again
the gloves were too large. She turned round the glass. "Above the
wrist" she reproached the grey-headed woman; who looked and
agreed.
They waited; a clock ticked; Bond Street hummed, dulled, distant;
the woman went away holding gloves. "Above the wrist" said the
lady, mournfully, raising her voice. And she would have to order
chairs, ices, flowers, and cloak-room tickets, thought Clarissa. The
people she didn't want would come; the others wouldn't. She would
stand by the door. They sold stockings—silk stockings. A lady is
known by her gloves and her shoes, old Uncle William used to say.
And through the hanging silk stockings quivering silver she looked at
the lady, sloping shouldered, her hand drooping, her bag slipping,
her eyes vacantly on the floor. It would be intolerable if dowdy
women came to her party! Would one have liked Keats if he had
worn red socks? Oh, at last—she drew into the counter and it
flashed into her mind:
"Do you remember before the war you had gloves with pearl
buttons?"
"French gloves, Madame?"
"Yes, they were French" said Clarissa. The other lady rose very sadly
and took her bag, and looked at the gloves on the counter. But they
were all too large—always too large at the wrist.
"With pearl buttons" said the shop-girl, who looked ever so much
older. She split the lengths of tissue paper apart on the counter. With
pearl buttons, thought Clarissa, perfectly simple—how French!
"Madame's hands are so slender" said the shop girl, drawing the
glove firmly, smoothly, down over her rings. And Clarissa looked at
her arm in the looking glass. The glove hardly came to the elbow.
Were there others half an inch longer? Still it seemed tiresome to
bother her—perhaps the one day in the month, thought Clarissa,
when it's an agony to stand. "Oh, don't bother" she said. But the
gloves were brought.
"Don't you get fearfully tired" she said in her charming voice,
"standing? When d'you get your holiday?"
"In September, Madame, when we're not so busy."
When we're in the country thought Clarissa. Or shooting. She has a
fortnight at Brighton. In some stuffy lodging. The landlady takes the
sugar. Nothing would be easier than to send her to Mrs Lumley's
right in the country (and it was on the tip of her tongue). But then
she remembered how on their honeymoon Dick had shown her the
folly of giving impulsively. It was much more important, he said, to
get trade with China. Of course he was right. And she could feel the
girl wouldn't like to be given things. There she was in her place. So
was Dick. Selling gloves was her job. She had her own sorrows quite
separate, "and now can never mourn, can never mourn" the words
ran in her head, "From the contagion of the world's slow stain"
thought Clarissa holding her arm stiff, for there are moments when it
seems utterly futile (the glove was drawn off leaving her arm flecked
with powder)—simply one doesn't believe, thought Clarissa, any
more in God.
The traffic suddenly roared; the silk stockings brightened. A
customer came in.
"White gloves," she said, with some ring in her voice that Clarissa
remembered.
It used, thought Clarissa, to be so simple. Down down through the
air came the caw of the rooks. When Sylvia died, hundreds of years
ago, the yew hedges looked so lovely with the diamond webs in the
mist before early church. But if Dick were to die to-morrow as for
believing in God—no, she would let the children choose, but for
herself, like Lady Bexborough, who opened the bazaar, they say, with
the telegram in her hand—Roden, her favourite, killed—she would
go on. But why, if one doesn't believe? For the sake of others, she
thought, taking the glove in her hand. This girl would be much more
unhappy if she didn't believe.
"Thirty shillings" said the shopwoman. "No, pardon me Madame,
thirty-five. The French gloves are more."
For one doesn't live for oneself, thought Clarissa.
And then the other customer took a glove, tugged it, and it split.
"There!" she exclaimed.
"A fault of the skin," said the grey-headed woman hurriedly.
"Sometimes a drop of acid in tanning. Try this pair, Madame."
"But it's an awful swindle to ask two pound ten!"
Clarissa looked at the lady; the lady looked at Clarissa.
"Gloves have never been quite so reliable since the war" said the
shop-girl, apologizing, to Clarissa.
But where had she seen the other lady?—elderly, with a frill under
her chin; wearing a black ribbon for gold eyeglasses; sensual, clever,
like a Sargent drawing. How one can tell from a voice when people
are in the habit, thought Clarissa, of making other people—"It's a
shade too tight" she said—obey. The shopwoman went off again.
Clarissa was left waiting. Fear no more she repeated, playing her
finger on the counter. Fear no more the heat o' the sun. Fear no
more she repeated. There were little brown spots on her arm. And
the girl crawled like a snail. Thou thy wordly task hast done.
Thousands of young men had died that things might go on. At last!
Half an inch above the elbow; pearl buttons; five and a quarter. My
dear slow coach, thought Clarissa, do you think I can sit here the
whole morning? Now you'll take twenty-five minutes to bring me my
change!
There was a violent explosion in the street outside. The shopwomen
cowered behind the counters. But Clarissa, sitting very up-right,
smiled at the other lady. "Miss Anstruther!" she exclaimed.
*** END OF THE PROJECT GUTENBERG EBOOK MRS DALLOWAY IN
BOND STREET ***

Updated editions will replace the previous one—the old editions will
be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying
copyright royalties. Special rules, set forth in the General Terms of
Use part of this license, apply to copying and distributing Project
Gutenberg™ electronic works to protect the PROJECT GUTENBERG™
concept and trademark. Project Gutenberg is a registered trademark,
and may not be used if you charge for an eBook, except by following
the terms of the trademark license, including paying royalties for use
of the Project Gutenberg trademark. If you do not charge anything
for copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the free


distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund
from the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only be


used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law
in the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name associated
with the work. You can easily comply with the terms of this
agreement by keeping this work in the same format with its attached
full Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears,
or with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived


from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted


with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning
of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this


electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebookgate.com

You might also like