0% found this document useful (0 votes)
15 views56 pages

Algorithms For Next-Generation Sequencing 1st Edition Wing-Kin Sung

The document promotes the ebook collection available at textbookfull.com, featuring titles on next-generation sequencing, computational methods, and deep learning. It includes links to various textbooks and emphasizes the ability to download in multiple formats for convenience. Additionally, it provides an overview of the content structure for the book 'Algorithms for Next-Generation Sequencing' by Wing-Kin Sung.

Uploaded by

wokouarmain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views56 pages

Algorithms For Next-Generation Sequencing 1st Edition Wing-Kin Sung

The document promotes the ebook collection available at textbookfull.com, featuring titles on next-generation sequencing, computational methods, and deep learning. It includes links to various textbooks and emphasizes the ability to download in multiple formats for convenience. Additionally, it provides an overview of the content structure for the book 'Algorithms for Next-Generation Sequencing' by Wing-Kin Sung.

Uploaded by

wokouarmain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Explore the full ebook collection and download it now at textbookfull.

com

Algorithms for Next-Generation Sequencing 1st


Edition Wing-Kin Sung

https://textbookfull.com/product/algorithms-for-next-
generation-sequencing-1st-edition-wing-kin-sung/

OR CLICK HERE

DOWLOAD EBOOK

Browse and Get More Ebook Downloads Instantly at https://textbookfull.com


Click here to visit textbookfull.com and download textbook now
Your digital treasures (PDF, ePub, MOBI) await
Download instantly and pick your perfect format...

Read anywhere, anytime, on any device!

Computational Methods for Next Generation Sequencing Data


Analysis 1st Edition Ion Mandoiu

https://textbookfull.com/product/computational-methods-for-next-
generation-sequencing-data-analysis-1st-edition-ion-mandoiu/

textbookfull.com

Next of Kin 1st Edition Elton Skelter

https://textbookfull.com/product/next-of-kin-1st-edition-elton-
skelter/

textbookfull.com

Next of Kin 1st Edition Elton Skelter

https://textbookfull.com/product/next-of-kin-1st-edition-elton-
skelter-2/

textbookfull.com

Fundamentals of Deep Learning Designing Next Generation


Machine Intelligence Algorithms 1st Edition Nikhil Buduma

https://textbookfull.com/product/fundamentals-of-deep-learning-
designing-next-generation-machine-intelligence-algorithms-1st-edition-
nikhil-buduma/
textbookfull.com
Security and Privacy for Next-Generation Wireless Networks
Sheng Zhong

https://textbookfull.com/product/security-and-privacy-for-next-
generation-wireless-networks-sheng-zhong/

textbookfull.com

Next Generation DNA Led Technologies 1st Edition Sharada


Avadhanam

https://textbookfull.com/product/next-generation-dna-led-
technologies-1st-edition-sharada-avadhanam/

textbookfull.com

Network Programmability and Automation Skills for the Next


Generation Network Engineer 1st Edition Jason Edelman

https://textbookfull.com/product/network-programmability-and-
automation-skills-for-the-next-generation-network-engineer-1st-
edition-jason-edelman/
textbookfull.com

Materials and Processes for Next Generation Lithography


1st Edition Alex Robinson And Richard Lawson (Eds.)

https://textbookfull.com/product/materials-and-processes-for-next-
generation-lithography-1st-edition-alex-robinson-and-richard-lawson-
eds/
textbookfull.com

Energetic Materials - Advanced Processing Technologies for


Next-Generation Materials 1st Edition Mark J. Mezger (Ed.)

https://textbookfull.com/product/energetic-materials-advanced-
processing-technologies-for-next-generation-materials-1st-edition-
mark-j-mezger-ed/
textbookfull.com
ALGORITHMS FOR
NEXT-GENERATION SEQUENCING
ALGORITHMS FOR
NEXT-GENERATION SEQUENCING

Wing-Kin Sung
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2017 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20170421

International Standard Book Number-13: 978-1-4665-6550-0 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents

Preface xi

1 Introduction 1
1.1 DNA, RNA, protein and cells . . . . . . . . . . . . . . . . . . 1
1.2 Sequencing technologies . . . . . . . . . . . . . . . . . . . . . 3
1.3 First-generation sequencing . . . . . . . . . . . . . . . . . . . 4
1.4 Second-generation sequencing . . . . . . . . . . . . . . . . . 6
1.4.1 Template preparation . . . . . . . . . . . . . . . . . . 6
1.4.2 Base calling . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.3 Polymerase-mediated methods based on reversible
terminator nucleotides . . . . . . . . . . . . . . . . . . 7
1.4.4 Polymerase-mediated methods based on unmodified
nucleotides . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.5 Ligase-mediated method . . . . . . . . . . . . . . . . . 11
1.5 Third-generation sequencing . . . . . . . . . . . . . . . . . . 12
1.5.1 Single-molecule real-time sequencing . . . . . . . . . . 12
1.5.2 Nanopore sequencing method . . . . . . . . . . . . . . 13
1.5.3 Direct imaging of DNA using electron microscopy . . 15
1.6 Comparison of the three generations of sequencing . . . . . . 16
1.7 Applications of sequencing . . . . . . . . . . . . . . . . . . . 17
1.8 Summary and further reading . . . . . . . . . . . . . . . . . 19
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 NGS file formats 21


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Raw data files: fasta and fastq . . . . . . . . . . . . . . . . . 22
2.3 Alignment files: SAM and BAM . . . . . . . . . . . . . . . . 24
2.3.1 FLAG . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 CIGAR string . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Bed format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Variant Call Format (VCF) . . . . . . . . . . . . . . . . . . . 29
2.6 Format for representing density data . . . . . . . . . . . . . 31
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

v
vi Contents

3 Related algorithms and data structures 35


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Recursion and dynamic programming . . . . . . . . . . . . . 35
3.2.1 Key searching problem . . . . . . . . . . . . . . . . . . 36
3.2.2 Edit-distance problem . . . . . . . . . . . . . . . . . . 37
3.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . 39
3.3.2 Unobserved variable and EM algorithm . . . . . . . . 40
3.4 Hash data structures . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Maintain an associative array by simple hashing . . . 43
3.4.2 Maintain a set using a Bloom filter . . . . . . . . . . . 45
3.4.3 Maintain a multiset using a counting Bloom filter . . . 46
3.4.4 Estimating the similarity of two sets using minHash . 47
3.5 Full-text index . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 Suffix trie and suffix tree . . . . . . . . . . . . . . . . 49
3.5.2 Suffix array . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.3 FM-index . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.3.1 Inverting the BWT B to the original text T 53
3.5.3.2 Simulate a suffix array using the FM-index . 54
3.5.3.3 Pattern matching . . . . . . . . . . . . . . . 55
3.5.4 Simulate a suffix trie using the FM-index . . . . . . . 55
3.5.5 Bi-directional BWT . . . . . . . . . . . . . . . . . . . 56
3.6 Data compression techniques . . . . . . . . . . . . . . . . . . 58
3.6.1 Data compression and entropy . . . . . . . . . . . . . 58
3.6.2 Unary, gamma, and delta coding . . . . . . . . . . . . 59
3.6.3 Golomb code . . . . . . . . . . . . . . . . . . . . . . . 60
3.6.4 Huffman coding . . . . . . . . . . . . . . . . . . . . . . 60
3.6.5 Arithmetic code . . . . . . . . . . . . . . . . . . . . . 62
3.6.6 Order-k Markov Chain . . . . . . . . . . . . . . . . . . 64
3.6.7 Run-length encoding . . . . . . . . . . . . . . . . . . . 65
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4 NGS read mapping 69


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Overview of the read mapping problem . . . . . . . . . . . . 70
4.2.1 Mapping reads with no quality score . . . . . . . . . . 70
4.2.2 Mapping reads with a quality score . . . . . . . . . . . 71
4.2.3 Brute-force solution . . . . . . . . . . . . . . . . . . . 72
4.2.4 Mapping quality . . . . . . . . . . . . . . . . . . . . . 74
4.2.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Align reads allowing a small number of mismatches . . . . . 76
4.3.1 Mismatch seed hashing approach . . . . . . . . . . . . 77
4.3.2 Read hashing with a spaced seed . . . . . . . . . . . . 78
4.3.3 Reference hashing approach . . . . . . . . . . . . . . . 82
4.3.4 Suffix trie-based approaches . . . . . . . . . . . . . . . 84
Contents vii

4.3.4.1 Estimating the lower bound of the number of


mismatches . . . . . . . . . . . . . . . . . . . 87
4.3.4.2 Divide and conquer with the enhanced pigeon-
hole principle . . . . . . . . . . . . . . . . . . 89
4.3.4.3 Aligning a set of reads together . . . . . . . 92
4.3.4.4 Speed up utilizing the quality score . . . . . 94
4.4 Aligning reads allowing a small number of mismatches
and indels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.1 q-mer approach . . . . . . . . . . . . . . . . . . . . . . 97
4.4.2 Computing alignment using a suffix trie . . . . . . . . 99
4.4.2.1 Computing the edit distance using a suffix trie 100
4.4.2.2 Local alignment using a suffix trie . . . . . . 103
4.5 Aligning reads in general . . . . . . . . . . . . . . . . . . . . 105
4.5.1 Seed-and-extension approach . . . . . . . . . . . . . . 107
4.5.1.1 BWA-SW . . . . . . . . . . . . . . . . . . . . 108
4.5.1.2 Bowtie 2 . . . . . . . . . . . . . . . . . . . . 109
4.5.1.3 BatAlign . . . . . . . . . . . . . . . . . . . . 110
4.5.1.4 Cushaw2 . . . . . . . . . . . . . . . . . . . . 111
4.5.1.5 BWA-MEM . . . . . . . . . . . . . . . . . . . 112
4.5.1.6 LAST . . . . . . . . . . . . . . . . . . . . . . 113
4.5.2 Filter-based approach . . . . . . . . . . . . . . . . . . 114
4.6 Paired-end alignment . . . . . . . . . . . . . . . . . . . . . . 116
4.7 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5 Genome assembly 123


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2 Whole genome shotgun sequencing . . . . . . . . . . . . . . . 124
5.2.1 Whole genome sequencing . . . . . . . . . . . . . . . . 124
5.2.2 Mate-pair sequencing . . . . . . . . . . . . . . . . . . 126
5.3 De novo genome assembly for short reads . . . . . . . . . . . 126
5.3.1 Read error correction . . . . . . . . . . . . . . . . . . 128
5.3.1.1 Spectral alignment problem (SAP) . . . . . . 129
5.3.1.2 k-mer counting . . . . . . . . . . . . . . . . . 133
5.3.2 Base-by-base extension approach . . . . . . . . . . . . 138
5.3.3 De Bruijn graph approach . . . . . . . . . . . . . . . . 141
5.3.3.1 De Bruijn assembler (no sequencing error) . 143
5.3.3.2 De Bruijn assembler (with sequencing errors) 144
5.3.3.3 How to select k . . . . . . . . . . . . . . . . . 146
5.3.3.4 Additional issues of the de Bruijn graph
approach . . . . . . . . . . . . . . . . . . . . 147
5.3.4 Scaffolding . . . . . . . . . . . . . . . . . . . . . . . . 150
5.3.5 Gap filling . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.4 Genome assembly for long reads . . . . . . . . . . . . . . . . 154
viii Contents

5.4.1 Assemble long reads assuming long reads have a low


sequencing error rate . . . . . . . . . . . . . . . . . . . 155
5.4.2 Hybrid approach . . . . . . . . . . . . . . . . . . . . . 157
5.4.2.1 Use mate-pair reads and long reads to improve
the assembly from short reads . . . . . . . . 160
5.4.2.2 Use short reads to correct errors in long reads 160
5.4.3 Long read approach . . . . . . . . . . . . . . . . . . . 161
5.4.3.1 MinHash for all-versus-all pairwise alignment 162
5.4.3.2 Computing consensus using Falcon Sense . . 163
5.4.3.3 Quiver consensus algorithm . . . . . . . . . . 165
5.5 How to evaluate the goodness of an assembly . . . . . . . . . 168
5.6 Discussion and further reading . . . . . . . . . . . . . . . . . 168
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

6 Single nucleotide variation (SNV) calling 175


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.1.1 What are SNVs and small indels? . . . . . . . . . . . 175
6.1.2 Somatic and germline mutations . . . . . . . . . . . . 178
6.2 Determine variations by resequencing . . . . . . . . . . . . . 178
6.2.1 Exome/targeted sequencing . . . . . . . . . . . . . . . 179
6.2.2 Detection of somatic and germline variations . . . . . 180
6.3 Single locus SNV calling . . . . . . . . . . . . . . . . . . . . 180
6.3.1 Identifying SNVs by counting alleles . . . . . . . . . . 181
6.3.2 Identify SNVs by binomial distribution . . . . . . . . 182
6.3.3 Identify SNVs by Poisson-binomial distribution . . . . 184
6.3.4 Identifying SNVs by the Bayesian approach . . . . . . 185
6.4 Single locus somatic SNV calling . . . . . . . . . . . . . . . . 187
6.4.1 Identify somatic SNVs by the Fisher exact test . . . . 187
6.4.2 Identify somatic SNVs by verifying that the SNVs
appear in the tumor only . . . . . . . . . . . . . . . . 188
6.4.2.1 Identify SNVs in the tumor sample by
posterior odds ratio . . . . . . . . . . . . . . 188
6.4.2.2 Verify if an SNV is somatic by the posterior
odds ratio . . . . . . . . . . . . . . . . . . . . 191
6.5 General pipeline for calling SNVs . . . . . . . . . . . . . . . 192
6.6 Local realignment . . . . . . . . . . . . . . . . . . . . . . . . 193
6.7 Duplicate read marking . . . . . . . . . . . . . . . . . . . . . 195
6.8 Base quality score recalibration . . . . . . . . . . . . . . . . 195
6.9 Rule-based filtering . . . . . . . . . . . . . . . . . . . . . . . 198
6.10 Computational methods to identify small indels . . . . . . . 199
6.10.1 Split-read approach . . . . . . . . . . . . . . . . . . . 199
6.10.2 Span distribution-based clustering approach . . . . . . 200
6.10.3 Local assembly approach . . . . . . . . . . . . . . . . 203
6.11 Correctness of existing SNV and indel callers . . . . . . . . . 204
6.12 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 205
Contents ix

6.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

7 Structural variation calling 209


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.2 Formation of SVs . . . . . . . . . . . . . . . . . . . . . . . . 211
7.3 Clinical effects of structural variations . . . . . . . . . . . . . 214
7.4 Methods for determining structural variations . . . . . . . . 215
7.5 CNV calling . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
7.5.1 Computing the raw read count . . . . . . . . . . . . . 218
7.5.2 Normalize the read counts . . . . . . . . . . . . . . . . 219
7.5.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . 219
7.6 SV calling pipeline . . . . . . . . . . . . . . . . . . . . . . . . 222
7.6.1 Insert size estimation . . . . . . . . . . . . . . . . . . . 222
7.7 Classifying the paired-end read alignments . . . . . . . . . . 223
7.8 Identifying candidate SVs from paired-end reads . . . . . . . 226
7.8.1 Clustering approach . . . . . . . . . . . . . . . . . . . 227
7.8.1.1 Clique-finding approach . . . . . . . . . . . . 228
7.8.1.2 Confidence interval overlapping approach . . 229
7.8.1.3 Set cover approach . . . . . . . . . . . . . . . 233
7.8.1.4 Performance of the clustering approach . . . 236
7.8.2 Split-mapping approach . . . . . . . . . . . . . . . . . 236
7.8.3 Assembly approach . . . . . . . . . . . . . . . . . . . . 237
7.8.4 Hybrid approach . . . . . . . . . . . . . . . . . . . . . 238
7.9 Verify the SVs . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.10 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

8 RNA-seq 245
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.2 High-throughput methods to study the transcriptome . . . . 247
8.3 Application of RNA-seq . . . . . . . . . . . . . . . . . . . . . 248
8.4 Computational Problems of RNA-seq . . . . . . . . . . . . . 250
8.5 RNA-seq read mapping . . . . . . . . . . . . . . . . . . . . . 250
8.5.1 Features used in RNA-seq read mapping . . . . . . . . 250
8.5.1.1 Transcript model . . . . . . . . . . . . . . . . 250
8.5.1.2 Splice junction signals . . . . . . . . . . . . . 252
8.5.2 Exon-first approach . . . . . . . . . . . . . . . . . . . 253
8.5.3 Seed-and-extend approach . . . . . . . . . . . . . . . . 256
8.6 Construction of isoforms . . . . . . . . . . . . . . . . . . . . 260
8.7 Estimating expression level of each transcript . . . . . . . . . 261
8.7.1 Estimating transcript abundances when every read
maps to exactly one transcript . . . . . . . . . . . . . 261
8.7.2 Estimating transcript abundances when a read maps to
multiple isoforms . . . . . . . . . . . . . . . . . . . . . 264
8.7.3 Estimating gene abundance . . . . . . . . . . . . . . . 266
x Contents

8.8 Summary and further reading . . . . . . . . . . . . . . . . . 268


8.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

9 Peak calling methods 271


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
9.2 Techniques that generate density-based datasets . . . . . . . 271
9.2.1 Protein DNA interaction . . . . . . . . . . . . . . . . . 271
9.2.2 Epigenetics of our genome . . . . . . . . . . . . . . . . 273
9.2.3 Open chromatin . . . . . . . . . . . . . . . . . . . . . 274
9.3 Peak calling methods . . . . . . . . . . . . . . . . . . . . . . 274
9.3.1 Model fragment length . . . . . . . . . . . . . . . . . . 276
9.3.2 Modeling noise using a control library . . . . . . . . . 279
9.3.3 Noise in the sample library . . . . . . . . . . . . . . . 280
9.3.4 Determination if a peak is significant . . . . . . . . . . 281
9.3.5 Unannotated high copy number regions . . . . . . . . 283
9.3.6 Constructing a signal profile by Kernel methods . . . 284
9.4 Sequencing depth of the ChIP-seq libraries . . . . . . . . . . 285
9.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . 286
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

10 Data compression techniques used in NGS files 289


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
10.2 Strategies for compressing fasta/fastq files . . . . . . . . . . 290
10.3 Techniques to compress identifiers . . . . . . . . . . . . . . . 290
10.4 Techniques to compress DNA bases . . . . . . . . . . . . . . 291
10.4.1 Statistical-based approach . . . . . . . . . . . . . . . . 291
10.4.2 BWT-based approach . . . . . . . . . . . . . . . . . . 292
10.4.3 Reference-based approach . . . . . . . . . . . . . . . . 295
10.4.4 Assembly-based approach . . . . . . . . . . . . . . . . 297
10.5 Quality score compression methods . . . . . . . . . . . . . . 299
10.5.1 Lossless compression . . . . . . . . . . . . . . . . . . . 300
10.5.2 Lossy compression . . . . . . . . . . . . . . . . . . . . 301
10.6 Compression of other NGS data . . . . . . . . . . . . . . . . 302
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

References 307

Index 339
Preface

Next-generation sequencing (NGS) is a recently developed technology enabling


us to generate hundreds of billions of DNA bases from the samples. We can
use NGS to reconstruct the genome, understand genomic variations, recover
transcriptomes, and identify the transcription factor binding sites or the epi-
genetic marks.
The NGS technology radically changes how we collect genomic data from
the samples. Instead of studying a particular gene or a particular genomic re-
gion, NGS technologies enable us to perform genome-wide study unbiasedly.
Although more raw data can be obtained from sequencing machines, we face
computational challenges in analyzing such a big dataset. Hence, it is impor-
tant to develop efficient and accurate computational methods to analyze or
process such datasets. This book is intended to give an in-depth introduction
to such algorithmic techniques.
The primary audiences of this book include advanced undergraduate stu-
dents and graduate students who are from mathematics or computer science
departments. We assume that readers have some training in college-level bi-
ology, statistics, discrete mathematics and algorithms.
This book was developed partly from the teaching material for the course
on Combinatorial Methods in Bioinformatics, which I taught at the National
University of Singapore, Singapore. The chapters in this book are classified
based on the application domains of the NGS technologies. In each chapter, a
brief introduction to the technology is first given. Then, different methods or
algorithms for analyzing such NGS datasets are described. To illustrate each
algorithm, detailed examples are given. At the end of each chapter, exercises
are given to help readers understand the topics.
Chapter 1 introduces the next-generation sequencing technologies. We
cover the three generations of sequencing, starting from Sanger sequencing
(first generation). Then, we cover second-generation sequencing, which in-
cludes Illumina Solexa sequencing. Finally, we describe the third-generation
sequencing technologies which include PacBio sequencing and nanopore se-
quencing.
Chapter 2 introduces a few NGS file formats, which facilitate downstream
analysis and data transfer. They include fasta, fastq, SAM, BAM, BED, VCF
and WIG formats. Fasta and fastq are file formats for describing the raw
sequencing reads generated by the sequencers. SAM and BAM are file formats

xi
xii Preface

for describing the alignments of the NGS reads on the reference genome. BED,
VCF and WIG formats are annotation formats.
To develop methods for processing NGS data, we need efficient algorithms
and data structures. Chapter 3 is devoted to briefly describing these tech-
niques.
Chapter 4 studies read mappers. Read mappers align the NGS reads on
the reference genome. The input is a set of raw reads in fasta or fastq files.
The read mapper will align each raw read on the reference genome, that is,
identify the region in the reference genome which is highly similar to the read.
Then, the read mapper will output all these alignments in a SAM or BAM
file. This is the basic step for many NGS applications. (It is the first step for
the methods in Chapters 6 9.)
Chapter 5 studies the de novo assembly problem. Given a set of raw reads
extracted from whole genome sequencing of some sample genome, de novo
assembly aims to stitch the raw reads together to reconstruct the genome.
It enables us to reconstruct novel genomes like plants and bacteria. De novo
assembly involves a few steps: error correction, contig assembly (de Bruijn
graph approach or base-by-base extension approach), scaffolding and gap fill-
ing. This chapter describes techniques developed for these steps.
Chapter 6 discusses the problem of identifying single nucleotide variations
(SNVs) and small insertions/deletions (indels) in an individual genome. The
genome of every individual is highly similar to the reference human genome.
However, each genome is still different from the reference genome. On average,
there is 1 single nucleotide variation in every 3000 bases and 1 small indel in
every 1000 bases. To discover these variations, we can first perform whole
genome sequencing or exome sequencing of the individual genome to obtain
a set of raw reads. After aligning the raw reads on the reference genome, we
use SNV callers and indel callers to call SNVs and small indels. This chapter
is devoted to discussing techniques used in SNV callers and indel callers.
Apart from SNVs and small indels, copy number variations (CNVs) and
structural variations (SVs) are the other types of variations that appear in our
genome. CNVs and SVs are not as frequent as SNVs and indels. Moreover, they
are more prone to change the phenotype. Hence, it is important to understand
them. Chapter 7 is devoted to studying techniques used in CNV callers and
SV callers.
All above technologies are related to genome sequencing. We can also se-
quence RNA. This technology is known as RNA-seq. Chapter 8 studies meth-
ods for analyzing RNA-seq. By applying computational methods on RNA-seq,
we can recover the transcriptome. More precisely, RNA-seq enables us to iden-
tify exons and split junctions. Then, we can predict the isoforms of the genes.
We can also determine the expression of each transcript and each gene.
By combining Chromatin immunoprecipitation and next-generation se-
quencing, we can sequence genome regions that are bound by some transcrip-
tion factors or with epigenetic marks. Such technology is known as ChIP-
seq. The computational methods that identify those binding sites are known
Preface xiii

as ChIP-seq peak callers. Chapter 9 is devoted to discussing computational


methods for such purpose.
As stated earlier, NGS data is huge; and the NGS data files are usually
big. It is difficult to store and transfer NGS files. One solution is to com-
press the NGS data files. Nowadays, a number of compression methods have
been developed and some of the compression formats are used frequently in
the literatures like BAM, bigBed and bigWig. Chapter 10 aims to describe
these compression techniques. We also describe techniques that enable us to
randomly access the compressed NGS data files.
Supplementary material can be found at
http://www.comp.nus.edu.sg/ ksung/algo in ngs/.
I would like to thank my PhD supervisors Tak-Wah Lam and Hing-
Fung Ting and my collaborators Francis Y. L. Chin, Kwok Pui Choi, Ed-
win Cheung, Axel Hillmer, Wing Kai Hon, Jansson Jesper, Ming-Yang Kao,
Caroline Lee, Nikki Lee, Hon Wai Leong, Alexander Lezhava, John Luk, See-
Kiong Ng, Franco P. Preparata, Yijun Ruan, Kunihiko Sadakane, Chialin Wei,
Limsoon Wong, Siu-Ming Yiu, and Louxin Zhang. My knowledge of NGS and
bioinformatics was enriched through numerous discussions with them. I would
like to thank Ramesh Rajaby, Kunihiko Sadakane, Chandana Tennakoon,
Hugo Willy, and Han Xu for helping to proofread some of the chapters. I
would also like to thank my parents Kang Fai Sung and Siu King Wong, my
three brothers Wing Hong Sung, Wing Keung Sung, and Wing Fu Sung, my
wife Lily Or, and my three kids Kelly, Kathleen and Kayden for their support.
Finally, if you have any suggestions for improvement or if you identify any
errors in the book, please send an email to me at [email protected]. I
thank you in advance for your helpful comments in improving the book.

Wing-Kin Sung
Chapter 1
Introduction

DNA stands for deoxyribonucleic acid. It was first discovered in 1869 by


Friedrich Miescher [58]. However, it was not until 1944 that Avery, MacLeod
and McCarty [12] demonstrated that DNA is the major carrier of genetic in-
formation, not protein. In 1953, James Watson and Francis Crick discovered
the basic structure of DNA, which is a double helix [310]. After that, people
started to work on DNA intensively.
DNA sequencing sprang to life in 1972, when Frederick Sanger (at the Uni-
versity of Cambridge, England) began to work on the genome sequence using
a variation of the recombinant DNA method. The full DNA sequence of a viral
genome (bacteriophage φX174) was completed by Sanger in 1977 [259, 260].
Based on the power of sequencing, Sanger established genomics,1 which is the
study of the entirety of an organism’s hereditary information, encoded in DNA
(or RNA for certain viruses). Note that it is different from molecular biology
or genetics, whose primary focus is to investigate the roles and functions of
single genes.
During the last decades, DNA sequencing has improved rapidly. We can
sequence the whole human genome within a day and compare multiple individ-
ual human genomes. This book is devoted to understanding the bioinformatics
issues related to DNA sequencing. In this introduction, we briefly review DNA,
RNA and protein. Then, we describe various sequencing technologies. Lastly,
we describe the applications of sequencing technologies.

1.1 DNA, RNA, protein and cells


Deoxyribonucleic acid (DNA) is used as the genetic material (with the
exception that certain viruses use RNA as the genetic material). The basic
building block of DNA is the DNA nucleotide. There are 4 types of DNA
nucleotides: adenine (A), guanine (G), cytosine (C) and thymine (T). The DNA

1 The actual term “genomics” is thought to have been coined by Dr. Tom Roderick, a

geneticist at the Jackson Laboratory (Bar Harbor, ME) at a meeting held in Maryland on
the mapping of the human genome in 1986.

1
2 Algorithms for Next-Generation Sequencing

50 A C G T A G C T 30
jj jjj jjj jj jj jjj jjj jj
30 T G C A T C G A 50

FIGURE 1.1: The double-stranded DNA. The two strands show a comple-
mentary base pairing.

nucleotides can be chained together to form a strand of DNA. Each strand of


DNA is asymmetric. It begins from 50 end and ends at 30 end.
When two opposing DNA strands satisfy the Watson-Crick rule, they can
be interwoven together by hydrogen bonds and form a double-stranded DNA.
The Watson-Crick rule (or complementary base pairing rule) requires that the
two nucleotides in opposing strands be a complementary base pair, that is,
they must be an (A, T) pair or a (C, G) pair. (Note that A = T and C G are
bound with the help of two and three hydrogen bonds, respectively.) Figure 1.1
gives an example double-stranded DNA. One strand is ACGTAGCT while the
other strand is its reverse complement, i.e., AGCTACGT.
The double-stranded DNAs are located in the nucleus (and mitochondria)
of every cell. A cell can contain multiple pieces of double-stranded DNAs, each
is called a chromosome. As a whole, the collection of chromosomes is called a
genome; the human genome consists of 23 pairs of chromosomes, and its total
length is roughly 3 billion base pairs.
The genome provides the instructions for the cell to perform daily life
functions. Through the process of transcription, the machine RNA polymerase
transcribes genes (the basic functional units) in our genome into transcripts
(or RNA molecules). This process is known as gene expression. The complete
set of transcripts in a cell is denoted as its transcriptome.
Each transcript is a chain of 4 different ribonucleic acid (RNA) nucleotides:
adenine (A), guanine (G), cytosine (C) and uracil (U). The main difference be-
tween the DNA nucleotide and the RNA nucleotide is that the RNA nucleotide
has an extra OH group. This extra OH group enables the RNA nucleotide to
form more hydrogen bonds. Transcripts are usually single stranded instead of
double stranded.
There are two types of transcripts: non-coding RNA (ncRNA) and message
RNA (mRNA). ncRNAs are transcripts that do not translate into proteins.
They can be classified into transfer RNAs (tRNAs), ribosomal RNAs (rRNAs),
short ncRNAs (of length < 30 bp, includes miRNA, siRNA and piRNA) and
long ncRNAs (of length > 200 bp, example includes Xist, and HOTAIR).
mRNA is the intermediate between DNA and protein. Each mRNA con-
sists of three parts: a 5’ untranslated region (a 5’ UTR), a coding region and
a 3’ untranslated region (3’ UTR). The length of the coding region is of a
multiple of 3. It is a sequence of triplets of nucleotides called codons. Each
codon corresponds to an amino acid.
Through translation, the machine ribosome translates each mRNA into a
Introduction 3

protein, which is the sequence of amino acids corresponding to the sequence of


codons in the mRNA. Protein forms complex 3D structures. Each protein is
a biological nanomachine that performs a specialized function. For example,
enzymes are proteins that work as catalysts to promote chemical reactions
for generating energy or digesting food. Other proteins, called transcription
factors, interact with the genome to turn on or off the transcriptions. Through
the interaction among DNA, RNA and protein, our genome dictates which
cells should grow, when cells should die, how cells should be structured, and
creates various body parts.
All cells in our body are developed from a single cell through cell division.
When a cell divides, the double helix genome is separated into single-stranded
DNA molecules. An enzyme called DNA polymerase uses each single-stranded
DNA molecule as the template to replicate the genome into two identical
double helixes. By this replication process, all cells within the same individual
will have the same genome. However, due to errors in copying, some variations
(called mutations) might happen in some cells. Those variations or mutations
may cause diseases such as cancer.
Different individuals have similar genomes, but they also have genome
variations that contribute to different phenotypes. For example, the color of
our hairs and our eyes are controlled by the differences in our genomes. By
studying and comparing genomes of different individuals, researchers develop
an understanding of the factors that cause different phenotypes and diseases.
Such knowledge ultimately helps to gain insights into the mystery of life and
contributes to improving human health.

1.2 Sequencing technologies


DNA sequencing is a process that determines the order of the nucleotide
bases. It translates the DNA of a specific organism into a format that is deci-
pherable by researchers and scientists. DNA sequencing has allowed scientists
to better understand genes and their roles within our body. Such knowledge
has become indispensable for understanding biological processes, as well as in
application fields such as diagnostic or forensic research. The advent of DNA
sequencing has significantly accelerated biological research and discovery.
To facilitate the genomics study, we need to sequence the genomes of differ-
ent species or different individuals. A number of sequencing technologies have
been developed during the last decades. Roughly speaking, the development
of the sequencing technologies consists of three phases:

First-generation sequencing: Sequencing based on chemical degradation


and gel electrophoresis.
4 Algorithms for Next-Generation Sequencing

Second-generation sequencing: Sequencing many DNA fragments in par-


allel. It has higher yield, lower cost, but shorter reads.

Third-generation sequencing: Sequencing a single DNA molecule with-


out the need to halt between read steps.

In this section, we will discuss the three phases in detail.

1.3 First-generation sequencing


Sanger and Coulson proposed the first-generation sequencing in 1975 [259,
260]. It enables us to sequence a DNA template of length 500 1000 within a
few hours. The detailed steps are as follows (see Figure 1.3).

1. Amplify the DNA template by cloning.

2. Generate all possible prefixes of the DNA template.

3. Separation by electrophoresis.

4. Readout with fluorescent tags.

Step 1 amplifies the DNA template. The DNA template is inserted into
the plasmid vector; then the plasmid vector is inserted into the host cells for
cloning. By growing the host cells, we obtain many copies of the same DNA
template.
Step 2 generates all possible prefixes of the DNA template. Two tech-
niques have been proposed for this step: (1) the Maxam-Gilbert technique [194]
and (2) the chain termination methodology (Sanger method) [259, 260]. The
Maxam-Gilbert technique relies on the cleaving of nucleotides by chemical.
Four different chemicals are used and generate all sequences ending with A, C, G
and T, respectively. This allows us to generate all possible prefixes of the tem-
plate. This technique is most efficient for short DNA sequences. However, it
is considered unsafe because of the extensive use of toxic chemicals.
The chain termination methodology (Sanger method) is a better alter-
native. Given a single-stranded DNA template, the method performs DNA
polymerase-dependent synthesis in the presence of (1) natural deoxynu-
cleotides (dNTPs) and (2) dideoxynucleotides (ddNTPs). ddNTPs serve as
non-reversible synthesis terminators (see Figure 1.2(a,b)). The DNA synthesis
reaction is randomly terminated whenever a ddNTP is added to the growing
oligonucleotide chain, resulting in truncated products of varying lengths with
an appropriate ddNTP at their 3’ terminus.
After we obtain all possible prefixes of the DNA template, the product is
a mixture of DNA fragments of different lengths. We can separate these DNA
Introduction 5

C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ dATP + H+ + PPi
(a)

C G T A A C G T A
C A C G T T C T G C A T C A C G T T C T G C A T
à
+ ddATP + H+ + PPi
(b)

FIGURE 1.2: (a) The chemical reaction for the incorporation of dATP into
the growing DNA strand. (b) The chemical reaction for the incorporation of
ddATP into the growing DNA strand. The vertical bar behind A indicates
that the extension of the DNA strand is terminated.

3’-GCATCGGCATATG...-5’
5’-CGTA
CGTA G - +
CGTAG C
CGTAGC C
CGTAGCC G
CGTAGCCG T
DNA Insert Insert
CGTAGCCGT A
template into into GCCGTATAC
CGTAGCCGTA T
vector host cell Cloning CGTAGCCGTAT A
CGTAGCCGTATA C Electrophoresis
Cyclic sequencing & readout

FIGURE 1.3: The steps of Sanger sequencing.

fragments by their lengths using gel electrophoresis (Step 3). Gel electrophore-
sis is based on the fact that DNA is negatively charged. When an electrical
field is applied to a mixture of DNA on a gel, the DNA fragments will move
from the negative pole to the positive pole. Due to friction, short DNA frag-
ments travel faster than long DNA fragments. Hence, the gel electrophoresis
separates the mixture into bands, each containing DNA molecules of the same
length.
Using the fluorescent tags attached to the terminal ddNTPs (we have
4 different colors for the 4 different ddNTPs), the DNA fragments ending
with different nucleotides will be labeled with different fluorescent dyes. By
detecting the light emitted from different bands, the DNA sequence of the
template will be revealed (Step 4).
In summary, the Sanger method can generate sequences of length 800 bp.
The process can be fully automated and hence it was a popular DNA sequenc-
6 Algorithms for Next-Generation Sequencing

ing method in 1970 2000. However, it is expensive and the throughput is


slow. It can only process a limited number of DNA fragments per unit of time.

1.4 Second-generation sequencing


Second-generation sequencing can generate hundreds of millions of short
reads per instrument run. When compared with first-generation sequencing,
it has the following advantages: (1) it uses clone-free amplification, and (2) it
can sequence many reads in parallel. Some commercially available technologies
include Roche/454, Illumina, ABI SOLiD, Ion Torrent, Helicos BioSciences
and Complete Genomics.
In general, second-generation sequencing involves the following two main
steps: (1) Template preparation and (2) base calling in parallel. The following
Section 1.4.1 describes Step 1 while Section 1.4.2 describes Step 2.

1.4.1 Template preparation

Given a set of DNA fragments, the template preparation step first gener-
ates a DNA template for each DNA fragment. The DNA template is created
by ligating adaptor sequences to the two ends of the target DNA fragment (see
Figure 1.4(a)). Then, the templates are amplified using PCR. There are two
common methods for amplifying the templates: (1) emulsion PCR (emPCR)
and (2) solid-phase amplification (Bridge PCR).
emPCR amplifies each DNA template by a bead. First of all, one piece of
DNA template and a bead are inserted within a water drop in oil. The surface
of every bead is coated with a primer corresponding to one type of adaptor.
The DNA template will hybridize with one primer on the surface of the bead.
Then, it is PCR amplified within a water drop in oil. Figure 1.4(b) illustrates
the emPCR. emPCR is used by 454, Ion Torrent and SOLiD.
For bridge PCR, the amplification is done on a flat surface (say, glass),
which is coated with two types of primers, corresponding to the adaptors.
Each DNA template is first hybridized to one primer on the flat surface.
Amplification proceeds in cycles, with one end of each bridge tethered to the
surface. Figure 1.4(c) illustrates the bridge PCR process. Bridge PCR is used
by Illumina.
Although PCR can amplify DNA templates, there is amplification bias.
Experiments revealed that templates that are AT-rich or GC-rich have a lower
amplification efficient. This limitation creates uneven sequencing of the DNA
templates in the sample.
Introduction 7

(a)

templates

beads

water drop in oil template binds PCR for


to the bead a few rounds
(b)

(c)

FIGURE 1.4: (a) From the DNA fragments, DNA template is created by
attaching the two ends with adaptor sequences. (b) Amplifying the template
by emPCR. (c) Amplifying the template by bridge PCR.

1.4.2 Base calling


Now we have many PCR clones of amplified templates (see Figure 1.5(a)).
This step aims to read the DNA sequences from the amplified templates in
parallel. This method is called the cyclic-array method. There are two ap-
proaches: the polymerase-mediated method (also called sequencing by syn-
thesis) and the ligase-mediated method (also called sequencing by ligation).
The polymerase-mediated method is further divided into methods based on re-
versible terminator nucleotides and methods based on unmodified nucleotides.
Below, we will discuss these approaches.

1.4.3 Polymerase-mediated methods based on reversible ter-


minator nucleotides
A reversible terminator nucleotide is a modified nucleotide. Similar to
ddNTPs, during the DNA polymerase-dependent synthesis, if a reversible ter-
minator nucleotide is incorporated onto the DNA template, the DNA synthesis
is terminated. Moreover, we can reverse the termination and restart the DNA
synthesis.
Figure 1.5(b) demonstrates how we use reversible terminator nucleotides
for sequencing. First, we hybridize the primer on the adaptor of the template.
Then, by DNA polymerase, a reversible terminator nucleotide is incorporated
onto the template. After that, we scan the signal of the dye attached to the
8 Algorithms for Next-Generation Sequencing

PCR clone
C C C
T C
T T T
G G G
C G
C C C
A A AC
T A
C C C C
T T T G T
G G G C G
C C C A C
A A A A

(a)

G
C C G C G C
T T T T ……
G G G G
Add After Repeat the
C C C C
reversible scanning, steps to
A A A A
terminator reverse the sequence
dGTP termination other bases

Wash &
scan
(b)

FIGURE 1.5: Polymerase-mediated sequencing methods based on reversible


terminator nucleotides. (a) PCR clones of the DNA templates are evenly dis-
tributed on a flat surface. Each PCR clone contains many DNA templates of
the same type. (b) The steps of polymerase-mediated methods are based on
reversible terminator nucleotides.

reversible terminator nucleotide by imaging. After imaging, the termination


is reversed by cleaving the dye-nucleotide linker. By repeating the steps, we
can sequence the complete DNA template.
Two commercial sequencers use this approach. They are Illumina and He-
licos BioSciences.
The Illumina sequencer amplifies the DNA templates by bridge PCR.
Then, all PCR clones are distributed on the glass plate. By using the four-
color cyclic reversible termination (CRT) cycle (see Figure 1.6(b)), we can
sequence all the DNA templates in parallel.
The major error of Illumina sequencing is substitution error, with a higher
portion of errors occurring when the previous incorporated nucleotide is a
base G.
Another major error of Illumina sequencing is that the accuracy decreases
with increasing nucleotide addition steps. The errors accumulate due to the
failure in cleaving off the fluorescent tags or due to errors in controlling the
Introduction 9

A C
C G
T T
G A
T C

(a)

(b)

A C G T A C G T
(c)

FIGURE 1.6: Polymerase-mediated sequencing methods based on reversible


terminator nucleotides. (a) A flat surface with many PCR clones. In particu-
lar, we show the DNA templates for two clones. (b) Four-color cyclic reversible
termination (CRT) cycle. Within each cycle, we extend the template of each
PCR clone by one base. The color indicates the extended base. Precisely, the
four colors, dark gray, black, white and light gray, correspond to the four nu-
cleotides A, C, G and T, respectively. (c) One-color cyclic reversible termination
(CRT) cycle. Each cycle tries to extend the template of each PCR clone by
one particular base. If the extension is successful, the white color is lighted
up.
10 Algorithms for Next-Generation Sequencing

reversible terminator nucleotides. Then, bases fail to get incorporated to the


template strand or extra bases might get incorporated [190].
Helicos BioSciences does not perform PCR amplification. It is a single
molecular sequencing method. It first immobilizes the DNA template on the
flat surface. Then, all DNA templates on the surface are sequenced in par-
allel by using a one-color cyclic reversible termination (CRT) cycle (see Fig-
ure 1.6(c)). Note that this technology can also be used to sequence RNA
directly by using reverse transcriptase instead of DNA polymerase. However,
the reads generated by Helicos BioSciences are very short ( 25 bp). It is also
slow and expensive.

1.4.4 Polymerase-mediated methods based on unmodified


nucleotides
The previous methods require the use of modified nucleotides. Actually,
we can sequence the DNA templates using unmodified nucleotides. The basic
observation is that the incorporation of a deoxyribonucleotide triphosphate
(dNTP) into a growing DNA strand involves the formation of a covalent bond
and the release of pyrophosphate and a positively charged hydrogen ion (see
Figure 1.2). Hence, it is possible to sequence the DNA template by detecting
the concentration change of pyrophosphate or hydrogen ion. Roche 454 and
Ion Torrent are two sequencers which take advantage of this principle.
The Roche 454 sequencer performs sequencing by detecting pyrophos-
phates. It is called pyrosequencing. First, the 454 sequencer uses emPCR to
amplify the templates. Then, amplified beads are loaded into an array of wells.
(Each well contains one amplified bead which corresponds to one DNA tem-
plate.) In each iteration, a single type of dNTP flows across the wells. If the
dNTP is complementary to the template in a well, polymerase will extend by
one base and relax pyrophosphate. With the help of enzymes sulfurylase and
luciferase, the pyrophosphate is converted into visual light. The CDC camera
detects the light signal from all wells in parallel. For each well, the light inten-
sity generated is recorded as a flowgram. For example, if the DNA template
in a well is TCGGTAAAAAACAGTTTCCT, Figure 1.7 is the corresponding
flowgram. Precisely, the light signal can be detected only when the dNTP that
flows across the well is complementary to the template. If the template has
a homopolymer of length k, the light intensity detected is k-fold higher. By
interpreting the flowgram, we can recover the DNA sequence.
However, when the homopolymer is long (say longer than 6), the detec-
tor is not sensitive enough to report the correct length of the homopolymer.
Therefore, the Roche 454 sequencer gives higher rate of indel errors.
Ion Torrent was created by the person as Roche 454. It is the first semi-
conductor sequencing chip available on the commercial market. Instead of
detecting pyrophosphate, it performs sequencing by detecting hydrogen ions.
The basic method of Ion Torrent is the same as that of Roche 454. It also uses
emPCR to amplify the templates and the amplified beads are also loaded into
Introduction 11

6
5

intensity
4
3
2
1
ACGTACGTACGTACGTACGT

FIGURE 1.7: The flowgram for the DNA sequence TCG-


GTAAAAAACAGTTTCCT.

a high-density array of wells, and each well contains one template. In each
iteration, a single type of dNTP flows across the wells. If the dNTP is comple-
mentary to the template, polymerase will extend by one base and relax H+.
The relaxation of H+ changes the pH of the solution in the well and an IS-
FET sensor at the bottom of the well measures the pH change and converts it
into electric signals [251]. The sensor avoids the use of optical measurements,
which require a complicated camera and laser. This is the main difference
between Ion Torrent sequencing and 454 sequencing. The unattached dNTP
molecules are washed out before the next iteration. By interpreting the flow-
gram obtained from the ISFET sensor, we can recover the sequences of the
templates.
Since the method used by Ion Torrent is similar to that of Roche 454, it
also has the disadvantage that it cannot distinguish long homopolymers.

1.4.5 Ligase-mediated method


Instead of extending the template base by base using polymerase, ligase-
mediated methods use probes to check the bases on the template. ABI SOLiD
is the commercial sequencer that uses this approach. In SOLiD, the templates
are first amplified by emPCR. After that, millions of templates are placed on
a plate. SOLiD then tries to probe the bases of all templates in parallel. In
every iteration, SOLiD probes two adjacent bases of each template, i.e., it uses
two-base color encoding. The color coding scheme is shown in the following
table. For example, for the DNA template AT GGA, it is coded as A3102.

A C G T
A 0 1 2 3
C 1 0 3 2
G 2 3 0 1
T 3 2 1 0
12 Algorithms for Next-Generation Sequencing

The primary advantage of the two-base color encoding is that it improves


the single nucleotide variation (SNV) calling. Since every base is covered by
two color bases, it reduces the error rate for calling SNVs. However, conversion
from color bases to nucleotide bases is not simple. Errors may be generated
during the conversion process.
In summary, second-generation sequencing enables us to generate hundreds
of billions of bases per run. However, each run takes days to finished due to a
large number of scanning and washing cycles. Adding of a base per cycle is not
100% correct. This causes sequencing errors. Furthermore, base extensions of
some strands may be lag behind or lead forward. Hence, errors accumulate
as the reads get long. This is the reason why second-generation sequencing
cannot get very long read. Furthermore, due to the PCR amplification bias,
this approach may miss some templates with high or low GC content.

1.5 Third-generation sequencing


Although many of us are still using second-generation sequencing, third-
generation sequencing is coming. There is no fixed definition for third-
generation sequencing yet. Here, we define it as a single molecule sequencing
(SMS) technology without the need to halt between read steps (whether enzy-
matic or otherwise). A number of third-generation sequencing methods have
been proposed. They include:

Single-molecule real-time sequencing

Nanopore-sequencing technologies

Direct imaging of individual DNA molecules using advanced microscopy


techniques

1.5.1 Single-molecule real-time sequencing


Pacific BioSciences released their PacBio RS sequencing platform [71].
Their approach is called single-molecule real-time (SMRT) sequencing. It mim-
ics what happens in our body as cells divide and copies their DNA with the
DNA polymerase machine. Precisely, PacBio RS immobilizes DNA polymerase
molecules on an array slide. When the DNA template gets in touch with the
DNA polymerase, DNA synthesis happens with four fluorescently labeled nu-
cleotides. By detecting the light emitted, PacBio RS reconstructs the DNA
sequences. Figure 1.8 illustrates the SMRT sequencing approach.
PacBio RS sequencing requires no prior amplification of the DNA template.
Hence, it has no PCR bias. It can achieve more uniform coverage and lower GC
bias when compared with Illumina sequencing [79]. It can read long sequences
Introduction 13

Immobilized
polymerase

FIGURE 1.8: The illustration of PacBio sequencing. On an array slide,


there are a number of immobilized DNA polymerase molecules. When a DNA
template gets in touch with the DNA polymerase (see the polymerase at the
lower bottom right), DNA synthesis happens with the fluorescently labeled
nucleotides. By detecting the emitted light signal, we can reconstruct the
DNA sequence.

of length up to 20, 000 bp, with an average read length of about 10, 000 bp.
Another advantage of PacBio RS is that it can sequence methylation status
simultaneously.
However, PacBio sequencing is more costly. It is about 3 4 times more
expensive than short read sequencing. Also, PacBio RS has a high error rate,
up to 17.9% errors [46]. The majority of the errors are indel errors [71]. Luckily,
the error rate is unbiased and almost constant throughout the entire read
length [146]. By repeatedly sequencing the same DNA template, we can reduce
the error rate.

1.5.2 Nanopore sequencing method


A nanopore is a pore of nano size on a thin membrane. When a voltage
is applied across the membrane, charged molecules that are small enough can
move from the negative well to the positive well. Moreover, molecules with
different structures will have different efficiencies in passing through the pore
and affect the electrical conductivity. By studying the electrical conductivity,
we can determine the molecules that pass through the pore.
This idea has been used in a number of methods for sequencing DNA.
These methods are called the nanopore sequencing method. Since nanopore
methods use unmodified DNA, it requires an extremely small amount of input
material. They also have the potential to sequence long DNA reads efficiently
at low cost. There are a number of companies working on the nanopore se-
quencing method. They include (1) Oxford Nanopore, (2) IBM Transistor-
mediated DNA sequencing, (3) Genia and (4) NABsys.
Oxford nanopore technology detects nucleotides by measuring the ionic
current flowing through the pore. It allows the single-strand DNA sequence to
14 Algorithms for Next-Generation Sequencing

FIGURE 1.9: An illustration of the sequencing technique of Oxford


nanopore.

flow through the pore continuously. As illustrated in Figure 1.9, DNA material
is placed in the top chamber. The positive charge draws a strand of DNA
moving from the top chamber to the bottom chamber flowing through the
nanopore. By detecting the difference in the electrical conductivity in the
pore, the DNA sequence is decoded. (Note that IBM’s DNA transistor is a
prototype which uses a similar idea.)
The approach has difficulty in calling the individual base accurately. In-
stead, Oxford nanopore technology will read the signal of k (say 5) bases
in each round. Then, using a hidden Markov model, the DNA base can be
decoded base by base.
Oxford nanopore technology has announced two sequencers: MiniION and
GridION. MiniION is a disposable USB-key sequencer. GridION is an ex-
pandable sequencer. Oxford nanopore technology claimed that GridION can
sequence 30x coverage of a human genome in 6 days at US$2200 $3600. It
has the potential to decode a DNA fragment of length 100, 000 bp. Its cost is
about US$25 $40 per gigabyte. Although it is not expensive, the error rate is
about 17.8% (4.9% insertion error, 7.8% deletion error and 5.1% substitution
error) [115].
Unlike Oxford nanopore technology, Genia suggested combining nanopore
and the DNA polymerase to sequence a single-strand DNA template. In Genia,
the DNA polymerase is tethered with a biological nanopore. When a DNA
template gets in touch with the DNA polymerase, DNA synthesis happens
with four engineered nucleotides for A, C, G and T , each attached with a
different short tag. When a nucleotide is incorporated into the DNA template,
the tag is cleaved and it will travel through the biological nanopore and an
electric signal is measured. Since different nucleotides have different tags, we
can reconstruct the DNA template by measuring the electric signals.
NABsys is another nanopore sequencer. It first chops the genome into DNA
fragments of length 100, 000 bp. The DNA fragments are hybridized with a
particular probe so that specific short DNA sequences on the DNA fragments
Exploring the Variety of Random
Documents with Different Content
brought into the office by a soldier, named Cope, and changed with
attacking him, with an intent to commit an unnatural crime. He
stated, that the preceding evening, about half past eleven o’clock,
he was stationed in the Park, under the wall of Marlborough-
gardens; when this gentleman came up to him, and after some
conversation, attempted to put his hand in his breeches; he
thereupon seized him, pushed him into his box, and kept him there
till the relief guard came;—that the gentleman did every thing he
could to persuade him to let him go; he offered him his watch, and
purse, containing seven or eight guineas, which he refused, and
marched him down to the guard-room. This was the charge; now,
intelligent reader, mark the defence!
On the magistrate’s asking the gentleman (who, to the disgrace of
the profession, was an eminent Conveyancer) what could bring him
into the Park at that hour, and in the most inclement evening that
was ever felt?—he answered, that he was obliged to go to
Buckingham-house to see a lady, from whom he wanted some
information respecting the subject of a letter he was writing, and
which he could not finish until he had procured it.
Sir William Addington,
“You could not have been admitted into Buckingham-house at that
time of night, especially for the purpose of visiting a lady!—have you
that letter about! you?”
To which he replied he had only just begun it.
Sir William Addington.—“Then, Sir, send for it in the state you left it.”
The answer was, ‘I do not recollect where I put it.’
Sir W. A.—“But after you had quitted Buckingham-house, admitting
you had been there, how came you in that part of the Park where
this soldier was stationed?”
‘I intended to have gone through the stable yard, but having missed
it, I thought of going out at the Priory-gate.’
Sir W. A.—“Then why did you not go out there? for this soldier’s box
was considerably beyond that place!”—
Here a respectable gentleman of the profession came in, and very
judiciously put an end to the examination, by entering into a
recognizance for his client’s appearance: to whom the magistrate
observed, he never heard so clear a case of guilt, even from the
defendant’s own account.
Previous to the ensuing Westminster sessions Cope was sent on the
Windsor duty; whether by collusion, or by the common routine of
duty never transpired, further than strong suspicion upon the case.
However, from either Cause, Cope, if he had possessed the ability of
following up his charge by an indictment, had no opportunity; and,
of course, there the matter of complaint ended. But the dreadful
wretch, not content with his good fortune in escaping the Pillory, in
order to cleanse his pestiferous character, indicted Cope for falsely,
maliciously, and diabolically charging him with an attempt to commit
an unnatural crime.—Thunderstruck, when I heard of the audacious
attempt to legalize such an atrocious offence, I went, in company
with Mr. Bond, to hear the trial, if a trial it could be called:—but, to
get rid of the odious and painful relation, it is only necessary to add,
the poor ill-fated fellow was convicted, upon the testimony of the
vile old Sodomite; and, if my memory is correct, the sentence of the
Court was, that he should be confined in Tothill-fields Bridewell for
five years, and stand in the Pillory once every year during that time.
I may not be correct as to the number of times he stood in the
Pillory, but I know it was more than once.
When this phenomenon of a trial was ended, Mr. Bond, the
magistrate, and myself walked out of the court together, who
exclaimed—“what do you think of this conviction?” To which I
answered, I think as you do, that the jury ought to be taken out of
the box, and hanged at the hall door. However, what can be said?
but, alas, poor human nature! judges are but men, and juries
subject to fallability! It may be supposed, that the prosecutor
received every assistance that could be had: the late Mr. Bearcroft
appeared, to give him a character: a character for what?—why, that
he paid his debts, and that he would not pick a pocket; and that he
was never accused of committing a rape! which seems to be the
amount of character in every case, where a wretch is either
prosecutor or defendant on similar occasions. However, we will now
come to the touch-stone, which will, in a great measure, decide the
purity of this old lecher’s character. It seems the horrid sacrifice
made to purify his reputation, was not, in the opinion of sensible and
discerning men, perfectly satisfactory, whether his penchant for
breeches was not paramount to his affection for petticoats? And
even Mr. Bearcroft was not without a slender portion of scepticism
on the subject: for he remarked, in a company where some doubts
were entertained which of the two ought to have been put in the
pillory, Cope, or his prosecutor, “D—n the fellow! now I think of it, I
never remember his having a girl at college!”—But, to put all doubts
at rest, respecting this diabolical wretches’ guilt, it is only necessary
to state, that he retired into the country, and attacked a farmer’s
son; who seized, and took him before a magistrate, when he gave
bail: but, foreseeing that he had a man to deal with who was under
no military command, nor subject to any impediment for want of
money, and too inflexible for any chance of being diverted from his
steady purpose of prosecution, be wisely withdrew himself from
England—to which place he has never returned.
I shall now conclude my observations on these hedious transactions
with a relation of the unexampled oppression that Cook and his wife
have been the objects of. With respect to his wife, she could not
have participated in any transactions of her husband; she did not
live in the odious house, or even visit it; and, therefore, her
sufferings are not warranted by any plea:—and, indeed, whatever
the man’s offences might be, he has most amply purged himself
from all the criminal effects of it; and is surely entitled to the
protection of the same law to guard his innocence, that was exerted
to punish his offence. That I may be clearly understood in what I
deem acts of oppression, I must state the ground that gave pretence
to the proceedings against him.
The reader will recollect that Cook had been desirous of making a
disclosure of the transactions for which he, and the others were
convicted, with the names and rank in life of a great number of
persons implicated; not only at his own house, but at many others,
both private and public, the common depots of organized
Sodomites; and having been defeated of this intention at the
Secretary of State’s office, published the following Prospectus, in the
form of a handbill, and distributed them among his customers, both
noble and ignoble.

Handbill.
“Preparing for the Press, and speedily will be published,
“An exhibition of the gambols practised by the ancient lechers of
Sodom and Gomorrah; embellished and improved with the
modern refinements in Sodomitical practices, by the Members of
the Vere-street Coterie of detestable memory.
“The facts in this publication are given and substantiated on
oath, without regard to the rank or situation in life of the several
actors in this diabolical drama. Peers, Footmen, and Foot-
soldiers, will be held up to the indignation of mankind (aye and
womankind to) in the several characters they acted, the
hediousness of whose transactions petrify the reader.
“James Cook, the author of this pamphlet, with shame,
confesses that he was an eye and ear witness to the execrable
scenes of iniquity he relates, and, but for the damning proofs
that put to confusion every attempt at contradiction, he would
despair of gaining credit for an hundredth part of the odious
tale he has to unfold. He is now in Newgate, undergoing a
punishment for keeping the rendezvous of the miscreants; a
punishment which ought, in strict justice, to have fallen to the
lot of the characters whose infamy will shortly come under the
cognizance of the public.
“N.B. It is said that the ghost of White, the drummer boy, lately
executed for Sodomy, pays his nocturnal visits to old Moggy, the
rump-rider, Park-street, exclaiming,
“Monster! amidst the din of infernal howl
The fiends in hell will scramble for thy soul.”

The foregoing intimation caused some temporary abatement of


whiskered dalliance; and the softer passions gave way to the spirit of
revenge: the first fruits of which was a detainer for twenty-three
pounds debt, lodged against him on the eve of his enlargement,
after two year’s imprisonment, of which he had received a week’s
notice:—the attorney who issued out the writ was applied to, and
requested to give it to a Sheriff’s officer, to whom a sufficient bail-
bond should be given, although the defendant denied owing the
money. Such of my readers as are acquainted with the rules of
practice will conclude with me, that the ends of justice, and the
completion of an attorney’s duty would have been answered by this
means; or, at least, any attorney who had not been nursed by a
Siberian Tartar, would have thought himself justified by acceding to
the proposition; but it was made to a wretch,

“whose parent was a rock,


And fierce Hercanean tigers gave him birth.”

However, I see by the roll of attorneys, that the gentleman in


question stands alone; and that Providence seems satisfied with the
mischief done, by placing him there solo!—In my anxiety to procure
food for my next number of legalized cormorancy, I made some
inquiry respecting the general practice of this virtuous one, et
cetera:—the answer I received from one professional gentleman was
as satisfactory as laconic: “Did you ever see him?” I replied in the
negative:—then, said he, “his heart is as ugly as his face; and that is
as ugly as Kit Chrishop’s!”
Cook’s brother, a poor hard-working man, happened to have about
forty pounds in his possession, a sum devoted to pay for some
timber, to carry on his business as a bedstead-maker; and who,
under the agonizing affliction of a brother in such a predicament, in
order to facilitate his enlargement, deposited the amount of the
debt, and ten pounds costs, in the Sheriff’s office, until the return of
the writ, when it was intended to justify bail, and try the cause. I
will now account for Cook’s being defeated in a remedy provided
even by an Act of Parliament.
Being at liberty, and without a shilling to help himself, he had
recourse to several wretches who, in the burning frenzy of their love
fits, and to gratify the appetites of their languishing enamoratas,
ordered suppers, and run into other expences for which they were
not provided with the means of prompt payment, and which they
never thought fit to discharge during his two year’s confinement: he,
therefore, called upon some few of his equivocal gender customers,
to remind them that the dues arising out of their gallantries had not
been discharged. This led him, in company with his wife, to the
house of a Mr. Stewart, to enquire after his friend and neighbour Mr.
—, a ci-divant Reverend, who, it seems, had been unfrocked during
his probation in Newgate. Upon Cook’s name being announced, the
old hen began clucking, and caused a momentary cessation of
amorous billing with the salacious brood within. Cook, not being
able to get any satisfaction, walked away; but was followed by a
champion for the sacred rights of Sodom, who, at the distance of
two streets, came up to them, seized the woman by the arm, and,
after twisting her round, knocked her down!—My reader will be apt
to think that this was beginning at the wrong end of a manly
contest; but it must be remembered that the woman was the most
offensive object of the two; for Cook, having breeches on, possessed
the possibility of a reconciliation; but with a petticoat, no terms of
accommodation could take place: and in this case the antipathy was
widened; for Mrs. Cook is really a good-looking, nay, a pretty
woman. However J. Shenston, whether John or Jane I know not,
received from Cook a knock-down blow in (his) turn; with the token
of which clinging to his nose and month he ran back to Moggy
Stewart’s; and, after a fainting fit or two had been recovered,
Shenston assumed his male character, and posted away to
Marlborough-street; where, upon the usual terms, he obtained a
warrant against both Cook and his wife, for a pretended assault
committed upon him.
Reader! as the business becomes a little more serious, I must crave
a little unprejudiced attention. The warrant being executed in the
evening, I undertook to produce the parties next morning, before
the magistrates; they, of course, willingly appeared; when the hardy
villain related the particulars of the assault committed upon him, as
if he had never heard of the crime of perjury, or the punishment
annexed to it: however the two defendants were of course ordered
to find bail, which they did. And now begins a scene of oppression
sufficient to make a man believe that the actors were impatient for
their share of pillories and gibbets. When these two ill-fated people
were before the magistrates, Mr. Baker asked Shenstone what he
was?—he replied, a gentleman’s servant out of place. “How do you
get your living?” was the next question. ‘I am supported by my
friends at Birmingham:’ Cook, not chusing to let this Birmingham-
story, pass, gave his town-made description in the following
workman-like manner; “You are maintained by a set of Sodomites!
these are your friends that support you!”—Now, mark how this
business was reported in the public papers:
“Cook, the cidevant keeper of the Swan, in Vere-street, was brought
before the magistrates, for assaulting a respectable tradesman, with
an intent to extort money from him.” The account then stated the
names and places of residence of his bail; and that Cook had been
taken to Queen-square, a few days before, for a similar attempt on a
gentleman. With regard to what passed at Marlborough-street, I
have faithfully stated it; and what passed at Queen-square was
simply this, for I was present. Cook met a gentleman of some
consequence over Westminster-bridge, and used some unpleasing
language to him, which raised a mob: he thereupon threatened to
take Cook before a magistrate, who replied that he would meet him
at Queen-square; where they met, and, upon Cook’s promising not
to molest him again, he was discharged. The reporter, not satisfied
with the most shameful perversion of truth, added a malignant,
diabolical note, that extorting money is a capital offence.
I must now bestow a few observations upon the transactions at
Marlborough-street: and, first, with respect to the extortion and
respectability of the prosecutor, who, by his own account, is a
gentleman’s servant out of place—no very promising subject for
extortion; a fellow whom Cook states to be one of the lowest
attendants at Vere-street, for the most depraved purposes;—a fellow
that was obliged to borrow a great coat to appear before the
magistrates!!! Extortion! from a wretch living in a back garret, and
literally in rags, and whose breeches, from the superfluous
apertures, gave manifestation of rapturous attacks upon his virtue!
and whose finances were so far from satisfying the demands of
extortion, that the Constable was obliged to go to one of his
Birmingham friends, for money to defray the trifling expences at the
office.
I have only to observe, on the conduct of the reporter, that there
can be no act of diabolism equal to the mischief originating in the
ignorance, falshood, or that species of low bribed calumny, that is to
be procured from the necessities of a fellow, who would libel the
fairest character in the nation, for a beef-steak and pot of porter:
and it is astonishing, that magistrates will suffer such vermin to
infest the office:—at best, these hireling reporters are not to be
depended upon; and the experience of half a century enables me to
aver that I do not remember half a score correct reports, even of
what passed in the courts at Westminster. Nor does the mischief
close here: for the supplementary part of the atrocity is equally
dreadful; which consists in the conductors of the Papers almost
constantly refusing to correct their own errors, or censuring the
conduct of such pernicious scribblers.
The consequence attending the statement of which I am
complaining has been incalculable—for Cook’s poor brother, who is a
man of unblemished character, is totally ruined—no one will employ
him. Degraded in the estimation of his neighbours, his few creditors
came upon him; his landlord gave him warning to quit his house.
Such have been the effects of a poison, disseminated in a manner
that has admitted of no antidote. A similar fate, also, attended the
other bail:—Mrs. Cook turned out of her lodging at an hour’s notice,
and her necessaries stopped until she had paid rent for a certain
time she could not inhabit.
I think I shall be pardoned if I pause for a moment, and ask the
candid and liberal reader whether the Reporter of such unprincipled
and profligate falshoods is not entitled to some, and what reward?
Slitting his nose! cutting off his ears—or, exalting him to a situation,
where he may receive the kindred filth of his own pestiferous
principles. I am a friend to the liberty of the press in its fullest
extent; and have given many convincing proofs of it: but I am not so
incorrigible an offender against the law of libels, as to deny my title
to a gibbet, had I been the author of so vile, so miscreantic, and so
reptilized a calumny!
The reader will recollect that Cook had deposited the money he was
detained for, under what I call a stupid, useless Act of Parliament,
calculated for no possible convenience for a defendant; who could,
without the aid of this idiotic act, always deposit the debt and costs
with a Sheriff’s-officer until he gave a bail-bond: but, under this Act,
the debt is directed to be paid into Court, with ten pounds costs;
and when the defendant has justified bail, he must be at the
expence of a motion, to get the money restored: but here, this
miserable defendant was denied the benefit of that Act, or, indeed,
any existing law whatever. But it is some consolation to say, that he
is the only man in the kingdom that is to be told, that neither law,
justice, or humanity dare approach him:—for he has been denied the
right of submitting his case to a Jury, whether he owed the money
or not. The ten pounds costs having been deposited with the debt,
the plaintiff’s attorney has laid his hands upon the whole, and the
cause ended by a legalized robbery: for the infamous statement in
the newspaper, before mentioned, has shut out the possibility of
bailing the action and trying it.
Nor can the reptilized attorney, Wooley, who robbed them under
pretext of defending them, as stated in the beginning of this treatise
hope for a protraction of his iniquities longer than the ensuing term,
when I shall apply to the Court of King’s Bench, to appoint Cook an
Attorney; and I beg the candid reader will believe, when I assert
that I have used great economy in the expenditure of the vast sum
of Wooley’s iniquity, which the premier fiend of hell would tremble to
count over: but perhaps the hour is at no great distance, when I
shall be more profuse in the description of these person’s sufferings,
and oppression; when I undertake the task of identifying the men
with the crimes, and the crimes with the men.
In fine, the poor beggared brother has lost the money, without a
possibility of a remedy: and, as a proof that Cook, wretched as his
situation is at this day, was not, and is not, without the good opinion
of many respectable men, an eminent tradesman, as any in the city
of Westminster, offered to indemnify bail to any amount for him, or
bail him, if he could do it without becoming subject to Newspaper
calumny.
Immediately after bailing the assault, Cook was arrested the second
time, and taken to a lock-up house (for the same malignant fate
pursued him, to the exclusion of all possibility of bailing him) where
he remained until a habeas could be sued out; upon which he was
taken to the Fleet-prison, destitute of the means of procuring a
supper or a bed, and subject to the ruffianly insults of the unfeeling
and uniformed part of the prisoners.—About three days after he was
committed to the Fleet, the Sessions, for Middlesex commenced,
when he and his wife were both indicted for the pretended assault.
Cook being in the Fleet, I procured bail for the wife only; O say
procured—for, during forty years’ practice I never before, in any one
instance, hired bail: and, upon her being liberated, the magistrate,
of course, granted a supersedeas to the warrant. Yet, astonishing as
it may appear, at least to every professional man, this ill-fated
woman was seized on the Saturday-night following, as she was
going into the Fleet to her husband, by a ruffianly fellow named
Creswell, a city constable; and a companion, disdaining to be
behind-hand with his master in brutality; who took her to the Poultry
Compter, where she was kept until Monday in the most shocking of
all situations, among felons and prostitutes of the most abandoned
description.
That those who were paid for this outrage should not regard the
supersedeas does not astonish me; but that the keeper of the
compter should set a legal instrument at defiance, that strictly
forbade him to molest or imprison the defendant, I do confess,
staggers every conjecture I have been able to make upon the
subject:—I hope his conduct proceeded from sheer ignorance, and I
have some reason to think it did. However, determined not to desert
these unhappy people in any extremity, I attended the Alderman at
Guildhall, who very judiciously told this Mr. Constable Creswell he
could not take notice of any thing but the supersedeas; and, of
course, the woman was discharged. Notwithstanding this
determination of the Alderman, this same Creswell took her into the
County of Middlesex and delivered her to a constable belonging to
Marlborough-street-office; and, after being wearied and tortured for
two or three hours more, she was again discharged. It seems that
at Marlborough-street (to use a trite phrase) the cat jumped out of
the bag! for there arose a violent contention between Shenstone,
the nominal prosecutor, and these upright, assiduous, ministers of
justice, respecting, first, the amount of their wages, and, secondly,
who was so pay them?—which, I am informed, ended in that species
of mutual satisfaction that thieves generally experience, in dividing
the plunder of an empty purse!
Another act of outrage was committed by this Creswell against Cook
and his wife, about a fortnight previous to the execution of the
warrant in question; when he apprehended, and took them to the
Compter, without any warrant at all! where they were kept for the
whole night. In the morning they went before Sir Matthew Bloxam,
with all the marks and bruises they had received from Creswell over-
night: but as the worthy Alderman will, in all probability, be called
upon in a court of Justice, as a material witness, I shall add no more
at present, than that he acted like himself upon the occasion; and,
thinking that Cook had as much right to his own watch as Creswell,
with a severe reprimand compelled him to return it; [62] at the same
time we shall see if Creswell’s indemnity for all these outrages will
shelter him from the wholesome scourge of criminal justice.
I hope and trust that there is not a lover of genuine liberty, but will
contribute a little to enable these oppressed people to obtain
impartial justice; not only against Constable Creswell, but against his
employers, whose abominations can have no reliance upon impunity,
but from the destruction of Cook. And here I must repeat Lord
Ellenborough’s assertion, that although the parties may not be
entitled to respect, yet the laws must be respected! therefore, the
same laws that punish crimes, must protect innocence; and I hope
brutal constable Creswell, and rapacious lawyer Woolley, [63] will
receive a lesson, that will inspire them with some small portion of
reverence for his Lordship’s opinion.
The sufferings of Cook and his wife are marked with great
aggravation: for they are sacrificed, not as an atonement for their
own crimes, but that the crimes of others shall not be atoned for!
and for myself, I should not be surprised at any act of desperation
their wrongs may drive them to;—for, when injustice is thought no
crime, revenge becomes a virtue.

FINIS.
ADDENDA.

I think it necessary to give some account of the debt of £60 for


which Cook is now confined in the Fleet, at the suit of Mr. Henry
Meux.
Cook had dealt with the house of Starkey and Jennings for many
years; who, soon after his conviction, seized upon all his property,
both at the Swan in Vere-street, and at the White Horse, Long Acre;
the latter being kept by his wife separately, out of which she was
forcibly turned; but, by what authority, remains involved in mystery
to this day. His dealings with them appear to have been exemplary
correct and just, and such as by no means could warrant the step
they had taken;—there was no money due for any beer, for that in
the cellar was over-paid:—the mortgage on the lease was reduced
from £1175 to £116; and there was no debt of any kind at the time
of the levy; for he did not owe ten shillings in the world.
Among the property seized under this levy in Vere-street, were six
butts of porter, sent in by Mr. Henry Meux, the present plaintiff
against Cook. Some days after the levy, Mrs. Cook learnt from the
men in possession that all the effects were to be sold:—she
thereupon sent to Meux, to take his beer back;—and to the Distiller,
Mr. Temple: the latter of whom succeeded in getting away his
property. But Meux’s claim was resisted; and in a manner that
would have induced me (had Mr. Meux been my client) to have
advised him to bring his action, against Starkey and not Cook; for it
must be observed, this beer constituted the only debt due to Meux;
whose cooper and servants will prove the struggle Cook made to
have the beer returned: and I have no doubt but a jury will
remunerate him sufficiently, not only to discharge Mr. Meux’s
demand, but to recompense him for his imprisonment and other
wrongs he has suffered from the transactions of Starkey’s house:—
however, let my suspicions on the subject be what they may, this is
not a time to promulgate them; as I shall reserve them for the
contents of a council’s brief.—I have taken a vast deal of pains to sift
the conduct of this brewing community; and the end of my research
is, that no fault attaches to the Draymen, or horses! Knavery, as
well as Folly, distinctly may accomplish much mischief; but a
combination of both, generally destroys the effect of each other;—
which may, perhaps, be illustrated in the present case.
The levy was made on the 18th July, 1810, the property all taken,
and Mrs. Cook turned into the street; and on the 26th of September
following, being a space of two months that Cook had received no
account of his property; but the evening of the day previous to his
standing in the pillory, he was visited by Mr. Batt, the fac-totum of
Starkey’s Brewhouse, and some other persons, to finally settle the
accounts with the house. It must be confessed it was an hour illy
calculated to settle accounts—more especially such accounts. Cook,
in the moment of distraction, expecting to meet a violent death in a
few hours, had neither time or spirits to expostulate; he made no
objection to any settlement; a cabbage-leaf would have been as
satisfactory to him then as the following receipt.

£381 13s. 2d.


26th Sept. 1810,
Received of Mr. Cook the sum of three hundred and eighty-one
pounds 13s. 2d. being the balance of his account, and in full of
all demands up to this day, due to Mrs. Starkey.
WILLIAM BATT.

However Mr. Batt thought the account very satisfactorily settled; and
if Cook had been murdered the next day, it might have remained
undisturbed; but, in my opinion, this receipt will prove the fruitful
mother of a monstrous progeny! It was a cunning trick for a
Brewer! It is a pity he was so scanty of a few grains (I mean of
common humanity and candor) but he is a Brewer!—As to myself it is
conduct inexplicable! but time may mature this mis-shapen fœtus of
a mash-tub; and it may live to prove that the mutilated worm that is
trod in the earth to day, may rise a scorpion to-morrow, and sting its
oppressor to the heart.
After all, it is to be wished, for the sake of public, justice and
common humanity, that Mr. Meux would, for the present, discharge a
man from prison, of whose integrity and anxiety to do him justice he
has had such convincing proofs; and I am not without a hope that
he will do it, as I see by the writ his Attorney is one of those few
men who disdain making a bill of costs out of the bowels of
wretchedness. If Mr. Meux thinks the man possessed means of
paying for a dinner when he was taken to a sponging house, he has
been criminally imposed upon: for, from the moment he arrested
him, to the present time, he has not had the means of procuring a
two-penny loaf, but from the most mortifying mendicity. I do not
wish to make any observations upon the unprovoked conduct of
arresting him, for I firmly believe Mr. Meux is wholly ignorant of his
wretched situation.
After Cook’s trial for the pretended assault, the public will be favored
with the names and residences of the parties who are the principal
objects of this publication; a great number of whom will be
compelled to attend in Court, to give evidence on particular points
connected with the trial.
I have only to add, in conclusion, that, as I begun with an apology
for writing these pages at all, I now feel an equal inclination to
apologize for having written them so ill, and so unworthy the pen of
any man laying the least claim to literary abilities;—it has been an
odious task; but my end is answered if it procures the injured man
and his wife that justice I think them entitled to; and I hope the sale
of it will afford them some relief.
HOLLOWAY.
6, Richmond Buildings, Soho.

N.B. Cook intreats me to say, that during the twelve years he was in
business he dealt with the following persons, exclusive of Starkey
and Meux; whose justice he now challenges, to say if his conduct
has not been uniformly marked with integrity.

Rickets and Hill Distillers


Sharp and Lucas Ale Brewers
Blackbird and Burleigh Porter ditto
Stephens and Paget Distillers
Felix Calvert Porter Brewer
Brown and Parry Ditto
Parry and Brown Distillers
Temple Ditto
Warren Brandy Merchant
Wells Distillers.

Does this man deserve the treatment of a cheat, a swindler, or a


thief?

HOLLOWAY, PRINTER, ARTILLERY LANE, TOOLEY STREET.


FOOTNOTES.

[62] It is very remarkable that Creswell, the author of the outrage


just stated, should be selected for the execution of the warrant for
the assault; and it is equally unfortunate for the poor woman, that
this Parochial Pole-cat lives in the neighbourhood of the Fleet-prison,
and is continually insulting her.
[63] The Fifth Number of Strictures on the Practice of Attorneys will
soon be published; when the wide-spreading branches of this
fellow’s iniquity will appear in full bloom;—the pestiferous fumes of
which will give the law-sick fiends in Tartarus a vomit.
*** END OF THE PROJECT GUTENBERG EBOOK THE PHŒNIX OF
SODOM; OR, THE VERE STREET COTERIE ***

Updated editions will replace the previous one—the old editions


will be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States
copyright in these works, so the Foundation (and you!) can copy
and distribute it in the United States without permission and
without paying copyright royalties. Special rules, set forth in the
General Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the


free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree
to abide by all the terms of this agreement, you must cease
using and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only


be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project Gutenberg™
works in compliance with the terms of this agreement for
keeping the Project Gutenberg™ name associated with the
work. You can easily comply with the terms of this agreement
by keeping this work in the same format with its attached full
Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project


Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like