0% found this document useful (0 votes)
246 views258 pages

Bioinformatics Primer (An Introductory Handbook For Bioinformatics Practitioners)

This document provides an introduction to bioinformatics by covering topics such as cell biology, genetics, genomics, proteomics, model organisms, computational fundamentals, mathematical concepts, biological processes and experimental methods. It discusses DNA and protein sequencing techniques such as Sanger sequencing. It also covers genome mapping methods including genetic mapping, physical mapping and restriction mapping. The overall document serves as a primer for practitioners of bioinformatics.

Uploaded by

Sambit Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
246 views258 pages

Bioinformatics Primer (An Introductory Handbook For Bioinformatics Practitioners)

This document provides an introduction to bioinformatics by covering topics such as cell biology, genetics, genomics, proteomics, model organisms, computational fundamentals, mathematical concepts, biological processes and experimental methods. It discusses DNA and protein sequencing techniques such as Sanger sequencing. It also covers genome mapping methods including genetic mapping, physical mapping and restriction mapping. The overall document serves as a primer for practitioners of bioinformatics.

Uploaded by

Sambit Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 258

Bioinformatics Primer

(An Introductory Handbook for Bioinformatics


Practitioners)

Bio-Bio-1 Team

March 26, 2011


Forward

Forward described here ...

(by some eminent personality like Prof. Dr. Liaqat Ali...)

i
ii
Preface

(Team’s introduction to the project)

iii
iv
Contents

I Introduction... 1
1 Introduction to Bioinformatics 5

2 Introduction to Cell Biology 15


2.1 Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Cell Structure . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Cell Cycle & Cell Division Cycle . . . . . . . . . . . . . . 18
2.2 Chromosome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 DNA (DeoxyriboNucleic Acid) . . . . . . . . . . . . . . . . . . . 20
2.4 RNA (RiboNucleic Acid) . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Nucleotide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Introduction to Genetics and Genomics 25


3.1 Concept of Gene . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Discovery Chronology Revealing the Concept of Central Dogma
of Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Discovery of Gene Sequence . . . . . . . . . . . . . . . . . . . . . 27
3.4 Central Dogma of Biology . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Human Genome Project . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Common Terms used in Genetics . . . . . . . . . . . . . . . . . . 31

4 Introduction to Proteomics 35
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Amino Acids . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 General properties of Amino acids . . . . . . . . . . . . . 36
4.2.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2.2 Zwitter Ion . . . . . . . . . . . . . . . . . . . . . 37
4.2.2.3 Isomerism . . . . . . . . . . . . . . . . . . . . . . 37
4.2.2.4 Classification of Amino acids . . . . . . . . . . . 38
4.3 The Structure of Proteins . . . . . . . . . . . . . . . . . . . . . . 40
4.3.1 Primary Structure . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 Secondary Structure . . . . . . . . . . . . . . . . . . . . . 42

v
vi

4.4 Amino Acid Classifications . . . . . . . . . . . . . . . . . . . . . 42


4.5 Ramachandran Plot . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Some Bioinformatics Model Organisms 43


5.1 Origin and Early Evolution . . . . . . . . . . . . . . . . . . . . . 43
5.2 Virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Use of Virus in Life Sciences and Medicine . . . . . . . . 47
5.2.2 Use of Virus in Materials Science and Nanotechnology . . 48
5.3 Bacteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 Importance of Bacteria in Bioinformatics . . . . . . . . . 51
5.4 Escherichia coli . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5 Archaea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 Fungi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.7 Human Being . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Computing Fundamentals for Bioinformatics 55


6.1 Bioinformatics Problem Solving and Algorithm Development . . 55
6.1.1 Why Do We Need Algorithm? . . . . . . . . . . . . . . . . 56
6.1.2 How to Design an Algorithm? . . . . . . . . . . . . . . . . 57
6.1.3 How to Write Pseudocode . . . . . . . . . . . . . . . . . . 59
6.1.4 Types of Algorithm . . . . . . . . . . . . . . . . . . . . . 59
6.2 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Concept and Usage of Database . . . . . . . . . . . . . . . . . . . 59
6.4 Computational Model . . . . . . . . . . . . . . . . . . . . . . . . 59
6.5 Programming Concept and Applications . . . . . . . . . . . . . . 59
6.6 World Wide Web (WWW) . . . . . . . . . . . . . . . . . . . . . 59
6.7 Web Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7 Math Primer for Bioinformatics 61

8 Biological Processes, Experimental Methods & Machinery 63


8.1 DNA Cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.2 DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.3 Gel electrophoresis . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.4 DNA Cloning in Plasmid Vector . . . . . . . . . . . . . . . . . . 63
8.5 Sanger Method for DNA Sequencing . . . . . . . . . . . . . . . . 63
8.6 DNA Shotgun Sequencing . . . . . . . . . . . . . . . . . . . . . . 63
8.7 DNA Microarray . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.8 Recombinant DNA Technology . . . . . . . . . . . . . . . . . . . 63
8.9 Constructing Genomic and cDNA Libraries . . . . . . . . . . . . 63

II Introduction to Bioinformatics Problems 65


9 DNA & Protein Sequencing 69
9.1 DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
vii

9.2 History of DNA Sequencing . . . . . . . . . . . . . . . . . . . . . 70


9.3 Methods of DNA Sequencing . . . . . . . . . . . . . . . . . . . . 70
9.4 DNA Sequencing Process . . . . . . . . . . . . . . . . . . . . . . 71
9.5 DNA Sequencing in Real Time . . . . . . . . . . . . . . . . . . . 73
9.6 Next Generation DNA Sequencing . . . . . . . . . . . . . . . . . 73
9.7 Complete Genome Sequencing . . . . . . . . . . . . . . . . . . . . 73
9.8 Challenges of DNA Sequencing . . . . . . . . . . . . . . . . . . . 74
9.9 Usage of DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . 75
9.10 DNA Sequencing: Where to Next . . . . . . . . . . . . . . . . . . 76
9.11 Case Study: Human Genome Project . . . . . . . . . . . . . . . . 76
9.12 Protein Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . 76

10 Genome Mapping 79
10.1 Genetic Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
10.1.1 Landmarks of Genetic Maps . . . . . . . . . . . . . . . . . 81
10.1.2 Linkage Analysis . . . . . . . . . . . . . . . . . . . . . . . 81
10.2 Physical Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
10.3 Restriction Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 82
10.3.1 Historical Background . . . . . . . . . . . . . . . . . . . . 82
10.3.2 Restriction Map . . . . . . . . . . . . . . . . . . . . . . . 83
10.3.3 Restriction Mapping Process . . . . . . . . . . . . . . . . 84
10.3.4 Uses of Restriction Mapping . . . . . . . . . . . . . . . . 86

11 Sequences Alignment 91
11.1 DNA & Protein Sequences Comparison and Alignment . . . . . . 91
11.1.1 Sequence Alignment: . . . . . . . . . . . . . . . . . . . . . 92
11.1.2 Motivation for Sequence Alignment . . . . . . . . . . . . . 92
11.1.3 Similarity and Homology of Sequences . . . . . . . . . . . 93
11.1.4 Type of Sequence Alignment . . . . . . . . . . . . . . . . 94
11.1.5 Computational Methods & Models for Sequence Alignment 96
11.1.5.1 Dot Matrix . . . . . . . . . . . . . . . . . . . . . 97
11.1.5.2 Dynamic Programming . . . . . . . . . . . . . . 98
11.1.6 Importance of Sequence Alignment . . . . . . . . . . . . . 99
11.1.7 Sequence Alignment Tools . . . . . . . . . . . . . . . . . . 100
11.2 Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . 100
11.2.1 Methods for Multiple Sequence Alignment . . . . . . . . . 102
11.2.1.1 Dynamic Programming based Models . . . . . . 102
11.2.1.2 Statistical Methods and Probabilistic Models . . 102
11.2.2 Usage of Multiple Sequence Alignment . . . . . . . . . . . 103
11.2.3 Tools for Multiple Sequence Alignment . . . . . . . . . . . 103
11.3 Regulatory Motif Finding . . . . . . . . . . . . . . . . . . . . . . 103
11.3.1 Gene-Regulation & Regulatory Motif . . . . . . . . . . . . 104
11.3.2 Motif Discovery Methods . . . . . . . . . . . . . . . . . . 104
11.3.3 Tools for Motif Finding . . . . . . . . . . . . . . . . . . . 108
viii

12 Gene Prediction 109


12.1 Introduction to Genome Annotation & Gene Prediction . . . . . 109
12.1.1 Gene Finding Principles and Guidelines . . . . . . . . . . 110
12.1.2 Gene Prediction Approaches . . . . . . . . . . . . . . . . 113
12.1.2.1 Extrinsic approaches . . . . . . . . . . . . . . . . 113
12.1.2.2 Ab-initio Gene Prediction . . . . . . . . . . . . . 114
12.1.2.3 Comparative Gene Prediction . . . . . . . . . . 114
12.1.2.4 Homology-based Methods . . . . . . . . . . . . . 114
12.1.3 Gene Prediction Tools . . . . . . . . . . . . . . . . . . . . 115

13 Genome Analysis 117

14 Phylogenetic Analysis 119


14.1 Introduction of Phylogeny . . . . . . . . . . . . . . . . . . . . . . 119
14.2 Concept of Evolution & Evolutionary Model . . . . . . . . . . . . 120
14.3 Phylogenetic Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
14.4 Types of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . 123
14.5 Approaches in Phylogenetic Analysis . . . . . . . . . . . . . . . . 125
14.5.1 Phenetic(or Clustering) Approach . . . . . . . . . . . . . 125
14.5.2 Cladistic Approach . . . . . . . . . . . . . . . . . . . . . . 126
14.5.3 Evolutionary Systematic Approaches . . . . . . . . . . . . 126
14.6 Methods for Phylogenetic Tree-Construction . . . . . . . . . . . . 126
14.6.1 Distance-based Methods . . . . . . . . . . . . . . . . . . . 126
14.6.1.1 Unweighted Pair Group Method with Arithmetic
Mean (UPGMA) . . . . . . . . . . . . . . . . . . 126
14.6.1.2 Neighbor Joining Algorithm(NJ) . . . . . . . . . 129
14.6.1.3 Fitch-Margobiash (FM) Method . . . . . . . . . 129
14.6.1.4 Minimum Evolution (ME) Method . . . . . . . . 130
14.6.2 Character-based Method . . . . . . . . . . . . . . . . . . . 130
14.6.2.1 Maximum Parsimony (MP) Method . . . . . . . 130
14.6.2.2 Maximum Likelihood (ML) Method . . . . . . . 131
14.7 Phylogenetic Analysis Tools . . . . . . . . . . . . . . . . . . . . . 132

15 Protein Folding 133


15.1 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
15.2 Protein Classification . . . . . . . . . . . . . . . . . . . . . . . . . 134
15.3 Protein Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
15.4 Protein Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
15.4.1 Primary Structure . . . . . . . . . . . . . . . . . . . . . . 134
15.4.2 Secondary Structure . . . . . . . . . . . . . . . . . . . . . 134
15.4.2.1 α-helix . . . . . . . . . . . . . . . . . . . . . . . 134
15.4.2.2 β-sheets . . . . . . . . . . . . . . . . . . . . . . . 135
15.4.3 Tertiary Structure . . . . . . . . . . . . . . . . . . . . . . 135
15.4.4 Quaternary Structure . . . . . . . . . . . . . . . . . . . . 135
15.5 Experimental Techniques for Structure Determination . . . . . . 135
15.5.1 X-ray Crystallography . . . . . . . . . . . . . . . . . . . . 135
ix

15.5.2 Nuclear Magnetic Resonance spectroscopy (NMR) . . . . 136


15.5.3 Electron Microscopy/Diffraction . . . . . . . . . . . . . . 136
15.5.4 Free electron lasers . . . . . . . . . . . . . . . . . . . . . . 136
15.6 Protein Structure Classification . . . . . . . . . . . . . . . . . . . 136
15.6.1 Two types of algorithms . . . . . . . . . . . . . . . . . . . 136
15.7 Protein Structure Prediction . . . . . . . . . . . . . . . . . . . . 137
15.7.1 Stages of Protein Structure Prediction . . . . . . . . . . . 137
15.8 Secondary & Tertiary Structure Prediction Methods . . . . . . . 138
15.8.1 Ab-initio Method . . . . . . . . . . . . . . . . . . . . . . . 139
15.8.2 Statistical Method (old fashioned) . . . . . . . . . . . . . 140
15.8.3 Nearest Neighbor Approach . . . . . . . . . . . . . . . . . 140
15.8.4 Neural Network Approach . . . . . . . . . . . . . . . . . . 140
15.8.5 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . 141
15.8.6 Support Vector Machine based methods . . . . . . . . . . 141
15.9 Performance of Structure Prediction Approaches . . . . . . . . . 141
15.10Protein Databases . . . . . . . . . . . . . . . . . . . . . . . . . . 142
15.10.1 Structural Classification Databases . . . . . . . . . . . . . 142

16 Structural Bioinformatics & Drug Discovery 143


16.1 Traditional Methods of Drug Discovery . . . . . . . . . . . . . . 144
16.2 Modern Methods of Drug Discovery . . . . . . . . . . . . . . . . 144
16.3 Structural Bioinformatics . . . . . . . . . . . . . . . . . . . . . . 145
16.4 Bioinformatics and Drug Discovery Pipeline . . . . . . . . . . . . 145
16.4.1 Target Identification and Selection . . . . . . . . . . . . . 145
16.4.1.1 Types of Targets . . . . . . . . . . . . . . . . . . 146
16.4.2 Target Validation . . . . . . . . . . . . . . . . . . . . . . . 146
16.4.3 Assay Development . . . . . . . . . . . . . . . . . . . . . . 146
16.4.4 Lead Identification . . . . . . . . . . . . . . . . . . . . . . 146
16.4.5 Lead Development . . . . . . . . . . . . . . . . . . . . . . 146
16.4.6 Screening and Hits to Leads . . . . . . . . . . . . . . . . . 146
16.4.7 Lead Optimization . . . . . . . . . . . . . . . . . . . . . . 146
16.4.8 Drug Development . . . . . . . . . . . . . . . . . . . . . . 147
16.4.9 Drug Testing . . . . . . . . . . . . . . . . . . . . . . . . . 147
16.4.10 Preclinical Development . . . . . . . . . . . . . . . . . . . 147
16.4.11 Drug Toxicology . . . . . . . . . . . . . . . . . . . . . . . 147
16.4.12 Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . 147
16.4.13 NDA and New Drug to Market . . . . . . . . . . . . . . . 147
16.5 High-Throughput Screening (HTS) . . . . . . . . . . . . . . . . . 147
16.6 Ligand-based Drug Design . . . . . . . . . . . . . . . . . . . . . . 147
16.7 Computer Aided Drug Design (CADD) . . . . . . . . . . . . . . 147
16.8 Quantitative Structure Activity Relationships (QSAR) . . . . . . 148
16.9 Individual Drug Discovery . . . . . . . . . . . . . . . . . . . . . . 148
x

III Introduction to Bioinformatics Computations 149

17 Statistical and Probabilistic Methods in Bioinformatics 153


17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
17.2 Concept of Randomness and Variability . . . . . . . . . . . . . . 154
17.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 154
17.4 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 154
17.5 Linear Discriminent Analysis . . . . . . . . . . . . . . . . . . . . 154
17.6 Naive Bayes Classification . . . . . . . . . . . . . . . . . . . . . . 154

18 Computational Methods in Bioinformatics 155


18.1 Exhaustive Search . . . . . . . . . . . . . . . . . . . . . . . . . . 155
18.2 Discrete-State Models . . . . . . . . . . . . . . . . . . . . . . . . 155
18.3 Evolutionary Computation . . . . . . . . . . . . . . . . . . . . . 155
18.4 Greedy Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 155
18.5 String Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 155
18.6 Hybrid Computational Methods . . . . . . . . . . . . . . . . . . . 155

19 Bioinformatics Data Mining 157


19.1 What is Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . 157
19.2 Data Mining Task . . . . . . . . . . . . . . . . . . . . . . . . . . 158
19.3 Association Rules Mining . . . . . . . . . . . . . . . . . . . . . . 161
19.4 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
19.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
19.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
19.7 Fuzzy Classification . . . . . . . . . . . . . . . . . . . . . . . . . 161
19.8 Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . 161
19.9 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 161
19.10Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 161
19.11Machine Learning Approaches . . . . . . . . . . . . . . . . . . . . 161

20 Some Algorithms in Bioinformatics 163


20.1 BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
20.2 FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
20.3 CLUSTALW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
20.4 PHD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
20.5 Predator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
20.6 TRILOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
20.7 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
20.8 DALI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
xi

IV Some Widely Used Methods & Models in Bioinfor-


matics 165
21 Dynamic Programming And Bioinformatics 169
21.1 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . 169
21.1.1 Concept of Dynamic Programming . . . . . . . . . . . . . 169
21.1.2 Dynamic Programming Algorithm and Sequence Alignment171
21.1.3 Algorithm in Pseudocode . . . . . . . . . . . . . . . . . . 173
21.1.4 Global Alignment & DP . . . . . . . . . . . . . . . . . . . 174
21.1.5 Local Alignment & DP . . . . . . . . . . . . . . . . . . . 174
21.1.6 Alignment with Gap Penalty . . . . . . . . . . . . . . . . 174
21.1.7 Multiple Alignment & DP . . . . . . . . . . . . . . . . . . 176
21.1.8 Other Applications of Dynamic Programming . . . . . . . 176

22 Neural Network And Bioinformatics 179


22.1 Machine Learing . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
22.1.1 Why Machine Learning in Bioinformatics . . . . . . . . . 180
22.2 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . 181
22.3 Neural Network Architecture . . . . . . . . . . . . . . . . . . . . 182
22.3.1 Feed-Forward Neural Networks . . . . . . . . . . . . . . . 182
22.3.2 Training of Feed-Forward Neural Networks . . . . . . . . 183
22.4 Neural Network Learning Algorithms . . . . . . . . . . . . . . . . 184
22.4.1 Supervised Learning Neural Networks . . . . . . . . . . . 184
22.4.2 Unsupervised Learning Neural Networks . . . . . . . . . . 185

23 Hidden Markov Model (HMM) And Bioinformatics 189


23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
23.2 CpG islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
23.3 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
23.4 Hidden Markov Models (HMM) . . . . . . . . . . . . . . . . . . . 198
23.5 HMM and Pair wise Sequence Alignment . . . . . . . . . . . . . 199
23.6 Profile HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
23.7 HMM and Multiple Sequence Alignment . . . . . . . . . . . . . . 199
23.8 Advantages and Disadvantages using HMM . . . . . . . . . . . . 199
23.9 Other Application of HMM in Bioinformatics . . . . . . . . . . . 200
23.10Gene Finding using HMM . . . . . . . . . . . . . . . . . . . . . . 200

24 Genetic Programming And Bioinformatics 201


24.1 Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . 201
24.2 Analogy of Genetic Programming to Biology . . . . . . . . . . . 202
24.3 Steps of Genetic Programming . . . . . . . . . . . . . . . . . . . 202
24.4 Basic Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . 203
24.5 Some Points on Genetic Programming . . . . . . . . . . . . . . . 204
24.5.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
24.5.2 Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
24.5.3 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
xii

24.6 Parameters of Genetic Programming . . . . . . . . . . . . . . . . 207


24.7 Genetic Algorithm Performance . . . . . . . . . . . . . . . . . . . 207
24.8 Genetic Algorithm for Sequence Alignment . . . . . . . . . . . . 207
24.9 Applications of Genetic Algorithm . . . . . . . . . . . . . . . . . 211
24.10Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . 211

V Bioinformatics Tools 213


25 Python - Primer Programming Language for Bioinformatics 217

26 Python And Bioinformatics 219

27 Tools and Libraries for Bioinformatics 221

VI Bioinformatics : Current & Future 223


28 Prominent Research Areas in Bioinformatics 227

29 Endless Horizon of Bioinformatics: Future Directions 229

30 The Crazy Corner with ALL WILD Imaginations 231

A Bioinformatics Terminologies 233

B Amino Acid Lists 235

C Book Layout 237


List of Figures

2.1 A Typical Animal (Eukaryotic) Cell . . . . . . . . . . . . . . . . 16


2.2 Protein Synthesis is Started in the Ribosome . . . . . . . . . . . 17
2.3 Cell Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Nucleotide by Parts . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Structure of A, T, C, G and U . . . . . . . . . . . . . . . . . . . 21
2.6 Phosphate Molecule . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Pentose Sugar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Nucleotide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.9 Nucleotide Sequence . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10 DNA Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Central Dogma of Biology . . . . . . . . . . . . . . . . . . . . . . 27


3.2 Figure to be Drawn . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Synthesis of a Protein from mRNA by Translation(part-a) . . . . 28
3.4 Synthesis of a Protein from mRNA by Translation(part-b) . . . . 29
3.5 Normal Human Karyotype . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Structure of Animo Acid (INCOMPLETE...) . . . . . . . . . . . 37


4.2 Protein formation from amino acids . . . . . . . . . . . . . . . . 37
4.3 Structure of Zwitter Ion (INCOMPLETE...) . . . . . . . . . . . . 38
4.4 Amino Acids Classification . . . . . . . . . . . . . . . . . . . . . 39
4.5 Amino Acids Location Distribution . . . . . . . . . . . . . . . . . 40
4.6 Single- and three-letter codes for amino acids of a primary sequence 41
4.7 The primary sequences of human and sperm whale myoglobin . . 41

5.1 Bacteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1 Problem Solving Strategy . . . . . . . . . . . . . . . . . . . . . . 56


6.2 DNA Sequecne . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

9.1 DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . 70


9.2 DNA Sequencing Process . . . . . . . . . . . . . . . . . . . . . . 72

10.1 Cruciform structure of Restriction Site . . . . . . . . . . . . . . . 84


10.2 Single and Double Digest . . . . . . . . . . . . . . . . . . . . . . 85

xiii
xiv

10.3 Cloning the Plasmids . . . . . . . . . . . . . . . . . . . . . . . . . 87


10.4 Aligning Overlapping Clones . . . . . . . . . . . . . . . . . . . . 88
10.5 A Generic Physical Map . . . . . . . . . . . . . . . . . . . . . . . 89

11.1 Sequence Alignment (Not Yet Drawn) . . . . . . . . . . . . . . 93


11.2 Global Alignment (Not Yet Drawn) . . . . . . . . . . . . . . . 94
11.3 Local Alignment (Not Yet Drawn) . . . . . . . . . . . . . . . . 95
11.4 Pairwise Sequence Alignment (Not Yet Drawn) . . . . . . . . 96
11.5 Multiple Sequence Alignment (Not Yet Drawn) . . . . . . . . 96
11.6 Dot Matrix (Not Yet Drawn) . . . . . . . . . . . . . . . . . . . 97
11.7 Multiple Sequence Alignment - Evolutionary Tree . . . . . . . . . 101
11.8 Gene Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.9 Motif Logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.10Motif - Schematic Diagram . . . . . . . . . . . . . . . . . . . . . 106
11.11Motif Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

12.1 Gene Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110


12.2 Gene Coding Density . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.3 Gene Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
12.4 Open Read Frames . . . . . . . . . . . . . . . . . . . . . . . . . . 113
12.5 Gene Prediction Methodologies . . . . . . . . . . . . . . . . . . . 113

14.1 DNA Sequence Evolution . . . . . . . . . . . . . . . . . . . . . . 120


14.2 Homology & Similarity . . . . . . . . . . . . . . . . . . . . . . . . 121
14.3 Gene Duplication-Deletion & Speciation . . . . . . . . . . . . . . 121
14.4 Gene Duplication-Deletion & Speciation Example . . . . . . . . . 122
14.5 Phylogenetic Tree Description . . . . . . . . . . . . . . . . . . . . 123
14.6 Rooted Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
14.7 Unrooted Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
14.8 Distance between Two Sequences . . . . . . . . . . . . . . . . . . 127
14.9 Distance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
14.10Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
14.11Choice of Maximum Parsimony . . . . . . . . . . . . . . . . . . 130

19.1 What is Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . 158

21.1 Edit Graph (This is a Temp-Pic, New To be Drawn Later) . . . 172


21.2 Alignment Operations (This is a Temp-Pic, New To be Drawn
Later) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
21.3 ASM Matrix Building (This is a Temp-Pic, New To be Drawn
Later) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
21.4 ASM Matrix Cell Building (This is a Temp-Pic, New To be
Drawn Later) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
21.5 Dynamic Scoring Function(This is a Temp-Pic, New To be Drawn
Later) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
21.6 BLOSUM50 Matrix(This is a Temp-Pic, New To be Drawn Later)175
xv

21.7 DP Algo(This is a Temp-Pic, New To be Drawn Later) . . . . . 176


21.8 Local Alignment 1(This is a Temp-Pic, New To be Drawn Later) 177
21.9 Local Alignment 2(This is a Temp-Pic, New To be Drawn Later) 178
21.10Gap Penalty(This is a Temp-Pic, New To be Drawn Later) . . . 178
21.11Multiple Alignment(This is a Temp-Pic, New To be Drawn Later) 178

22.1 Biological Neural Network . . . . . . . . . . . . . . . . . . . . . . 181


22.2 Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . 183
22.3 Self-Organizing Map . . . . . . . . . . . . . . . . . . . . . . . . . 186

23.1 C − G pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190


23.2 CG dinucleotide . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
23.3 CpG islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
23.4 DNA Methylation . . . . . . . . . . . . . . . . . . . . . . . . . . 191
23.5 A Sequence from Human Genome with CG-dinucleotides . . . . . 191
23.6 Markov Chain for Two States . . . . . . . . . . . . . . . . . . . . 192
23.7 Complete Markov Chain Template . . . . . . . . . . . . . . . . . 194
23.8 Markov Chain for Dinucleotides . . . . . . . . . . . . . . . . . . . 195
23.9 Markov Chain for Dinucleotides from CpG-islands . . . . . . . . 195
23.10Markov Chain for Dinucleotides from non − CpG-islands . . . . . 196
23.11Combined Model for CpG & non − CpG islands . . . . . . . . . 198

24.1 Binary Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 204


24.2 Permutation Encoding . . . . . . . . . . . . . . . . . . . . . . . . 205
24.3 Permutation Encoding . . . . . . . . . . . . . . . . . . . . . . . . 205
24.4 Tree encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
24.5 Single Point Crossover . . . . . . . . . . . . . . . . . . . . . . . . 206
24.6 Multiple Point Crossover . . . . . . . . . . . . . . . . . . . . . . . 206
24.7 Uniform Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . 206
24.8 Arithmetic Crossover . . . . . . . . . . . . . . . . . . . . . . . . . 206
24.9 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
24.10Crossing over: Need to Draw . . . . . . . . . . . . . . . . . . . . 211
24.11Mutation: Need to Draw . . . . . . . . . . . . . . . . . . . . . . . 211
24.12Sequence Alignment Using Genetic Algorithm:Under Construction212
xvi
List of Tables

14.1 Number of Possible Rooted & Unrooted Phylogenetic Trees for


Different Number of OTUs . . . . . . . . . . . . . . . . . . . . . 125

16.1 List of Modulators . . . . . . . . . . . . . . . . . . . . . . . . . . 146

xvii
Part I

Introduction...

1
3

Introduction . . . Introduction . . . Introduction . . . Introduction . .


. Introduction . . . Introduction . . . Introduction . . . Introduction . . .
Introduction . . . Introduction . . . Introduction . . . Introduction . . .
Introduction . . . Introduction . . . Introduction . . . Introduction . . .
Introduction . . . Introduction . . . Introduction . . . Introduction . . .
Introduction . . . Introduction . . . Introduction . . . Introduction . . .
Introduction . . . Introduction . . . Introduction . . . Introduction . . .
Introduction . . . Introduction . . . Introduction . . . Introduction . . .
Introduction . . . Introduction . . . Introduction . . . Introduction . . .
4
Chapter 1

Introduction to
Bioinformatics

—Fokhruzzaman
Four technologies that will create major disruptions of our current realities are:
Information Technology (IT), Biotechnology (Bio), Nanotechnology (Nano), and
Neurotechnology (Neuro). All these four have already shown tremendous po-
tentials to influence all our future. Although IT cuts across all the scientific
disciplines now-a-days, but it made such a huge impact in Biotechnology that a
new discipline emerged as Bioinformatics. As the NCBI defines, Bioinformatics
is the field of science in which Biology, Computer Science, and Information Tech-
nology merge into a single discipline. The ultimate goal of the field is to enable
the discovery of new biological insights as well as to create a global perspective
from which unifying principles in Biology can be discerned. And thus the Ar-
tificial Intelligence field has essentially become a part of Bioinformatics! The 3
major objectives of Bioinformatics are: (1) Analyze the humongous amount of
Biological Data, (2) Develop smarter tools to handle the increasing complexi-
ties, (3) Interpret the results from both the wet-lab and in-silico experiments.
Some of the major Bioinformatics applications are: (a) Mapping of different
Biomolecules information, (b) Comparing DNA / RNA / Protein Sequences,
(c) Predicting 3-D structures of Gene-Products / Proteins, (d) Predicting func-
tions of Gene-Products / Proteins, (e) Designing Primers.

Our ability in the future to make new biological discoveries will depend strongly
on our ability to combine and correlate diverse data sets along multiple dimen-
sions and scales, rather than a continued effort focused in traditional areas.
Sequence data will have to be integrated with structure and function data, with
gene expression data, with pathways data, with phenotypic and clinical data,
and so forth. Basic research within bioinformatics will have to deal with these
issues of system and integrative biology, in the situation where the amount of
data is growing exponentially. The large amounts of data create a critical need

5
6 1. Introduction to Bioinformatics

for theoretical, algorithmic, and software advances in storing, retrieving, net-


working, processing, analyzing, navigating, and visualizing biological informa-
tion. In turn, biological systems have inspired computer science advances with
new concepts, including genetic algorithms, artificial neural networks, computer
viruses and synthetic immune systems, DNA computing, artificial life, and hy-
brid VLSI-DNA gene chips. This cross-fertilization has enriched both fields and
will continue to do so in the coming decades. In fact, all the boundaries be-
tween carbon based and silicon-based information processing systems, whether
conceptual or material, have begun to shrink.

Computational tools for classifying sequences, detecting weak similarities, sep-


arating protein coding regions from non-coding regions in DNA sequences, pre-
dicting molecular structure, post-translational modification and function, and
reconstructing the underlying evolutionary history have become an essential
component of the research process. This is essential to our understanding of life
and evolution, as well as to the discovery of new drugs and therapies. Bioinfor-
matics has emerged as a strategic discipline at the frontier between biology and
computer science, impacting medicine, biotechnology, and society in many ways.

Large databases of biological information create both challenging data mining


problems and opportunities, each requiring new ideas. In this regard, conven-
tional computer science algorithms have been useful, but are increasingly unable
to address many of the most interesting sequence analysis problems. This is due
to the inherent complexity of biological systems, brought about by evolutionary
tinkering, and to our lack of a comprehensive theory of lifes organization at
the molecular level. Machine-learning approaches (e.g. neural networks, hidden
Markov models, vector support machines, belief networks), on the other hand,
are ideally suited for domains characterized by the presence of large amounts of
data, ”noisy” patterns, and the absence of general theories. The fundamental
idea behind these approaches is to learn the theory automatically from the data,
through a process of inference, model fitting, or learning from examples. Thus
they form a viable complementary approach to conventional methods.

Introduction to Biological Systems (Claude-Henry Volmar, Nikunj Patel, Amita


N. Quadros, Daniel Paris, Venkatarajan S. Mathura, and Michael Mullan :: V.S.
Mathura, P. Kangueane, Bioinformatics: A Concept-Based Introduction, DOI
10.1007/978-0-387-84870-9 1,Springer Science+Business Media, LLC 2009, C.-
H. Volmar et al. )

1. Molecules of Life
Biochemical molecules such as deoxy ribo nucleic acid (DNA), ribo nucleic acid
(RNA), proteins, carbohydrates, and lipids are fundamental for cellular organi-
zation and their complex interplay with each other dictates various aspects of
living things. They enable a systematic execution of numerous biological pro-
cesses in a defined manner to maintain life at the cellular level (Kitano, 2002;
Noble, 2002). The genetic materials (DNA and RNA) are tightly regulated in
7

organisms. At any given moment, organisms have to deal with different pres-
sures (internal or external) by controlling various biochemical molecules thus
maintaining a balance or in other terms, homeostasis.

The proper function of biochemical molecules is crucial to the survival of any


given organism. Since mutations and other modifications caused by selective
pressures are sometimes irreparable, organisms often have to adapt in order to
survive and pass on their genes to the next generation. Organisms, therefore,
evolve. The mechanisms involved in such a difficult task as maintaining the
basic life of an organism are very complex. Regulation at the molecular level is
essential for the maintenance of life at the cellular level. Problems at the molec-
ular level often result in physiological alterations that in turn affect homeostasis
of the whole organism.

The life of an organism is mapped in its genome, a long sequence of nucleic


acids that consists of the entire set of chromosomes of the organism. Genes are
a stretch of nucleic acids, which represent a functional aspect of the genome.
Each gene codes for a limited set of proteins. The same genes may be found in
very distant animals such as a cow and a jellyfish, but their regulation (control
of the activity of those genes) may be different and appears to be of utmost
importance. Cellular processes such as apoptosis or programmed cell death are
encoded within the genome of individual organisms in a complex manner. With
the recent sequencing of the human genome, mankind has for the first time the
opportunity to attempt to understand the involvement of the genes in sequence
of events involved in the development of an organism (ontogenesis) as well as
in the etiology of various diseases. RNA is the product of the transcription of
DNA and is then translated into polypeptide, which folds into a functional form
called protein.

Any mutation in DNA, if not repaired by the various polymerases, may result
in the transcription of faulty RNA resulting in a wrong protein being translated
lacking its original activity. This may cause major problems such as protein
aggregation and misfolded proteins, which are not degradable and result in fa-
tal diseases. Naturally occurring single nucleotide polymorphism (SNP) among
human population may influence gene function and expression in individuals.
Functional variants or genetic changes like SNPs that alter amino acids in pro-
teins, gene expression, and gene splicing are of great interest.

The first step of regulation is trying to fix problems at the DNA levels. The
next step is to mend at the RNA level through gene splicing and then at the
protein level via proteosome / ubiquitin pathways. In eukaryotes, higher-level
organism, DNA is transcribed to RNA (Pre-mRNA) that consists of introns and
exons. The exons possess the codes that will be translated into proteins whereas
the introns are eventually cut out through gene splicing. The resulting RNA is
referred to as messenger RNA (mRNA). This messenger RNA may or may not
get translated into peptide that folds into a functional protein.
8 1. Introduction to Bioinformatics

2. Nucleic Acids: DNA Versus RNA


Nucleic acids are made of long chains of nucletotides that consist of nitrogenous
base, a sugar moiety, and phosphodieseter connections. Deoxyribonucleic acid
(DNA) is basically a sequence of nucleic acids that exists as a double helix.
It consists of the nitrogenous bases adenine (A), thymine (T), cytosine (C),
and guanine (G) (Watson and Crick, 1953). Adenine and guanine are purines
whereas Thymine and Cytosine are pyrimidines. In the DNA double helix,
purines always pair with pyrimidines by weak hydrogen bonds. This pairing is
based on the Watson and Crick complementation of A-T and G-C. This pairing
results in a double helix with a constant diameter of 20 Angstrom (), with a
complete helical turn every 34 , and consists of 10 bases per turn. Each branch
of the DNA double helix consists of a stretch of nucleotides (nitrogenous bases
attached to a sugar phosphate backbone). The two branches are then held to-
gether via hydrogen bonds between purines and pyrimidines that are on opposite
sides. This knowledge was crucial in understanding the process of heredity.

DNA has a semi-conservative replication (Meselson and Stahl, 1958). The dou-
ble helix opens up (in a fork-like fashion) and each strand serves as a parental
template for replication of the DNA. The replication occurs from 5 to 3 by DNA
polymerase. Each daughter strand ends up being the complement of a parental
strand. Subsequently, each replicated DNA fragment has one parental strand
and one daughter strand, hence the term semi-conservative. The genetic make-
up of an individual is termed genotype. Most of the DNA sequences among in-
dividuals are conserved but genetic variation in 0.1% of DNA influences disease
risk, metabolic activity, and drug response. It is important to map occurrence of
variation in the human genome, which can help to identify allelic polymorphisms
that result in disease. Computational techniques that can rapidly compare en-
tire genome and genes will help to identify polymorphism among population.
Comparative genomics is a field in which DNA sequences across several genomes
are compared to understand evolutionary aspects of biological processes.

RNA consists of the nitrogenous bases adenine (A), uracil (U), cytosine (C),
and guanine (G) and can fold into a complex tertiary structure with hair-pin
bends that have unpaired bases. Recurring RNA structural motifs have been
observed and attributed to biological function. Some of the conformationally
recurring motifs include GNRA-like tetraloop, S1, S2, kink turns. Comparative
Algorithm to Discover Recurring Elements of Structure (COMPADRES) is an
automated approach to identify such recurrent motifs (Wadley and Pyle, 2004).
Some of these motifs may contact residues in proteins that are essential for bi-
ological function, for example, a pi-turn motif is found on RNA that interacts
with ribosomal protein L2.

3. Understanding Proteins: Sequence-Structure-Function


Understanding a protein involves understanding its sequence, structure, and
function. Primary sequence of a protein can be represented by 20 unique alpha-
9

bets. Individual properties and standard residue codes for each of these amino
acids can be obtained from the following website: http://www.imb-jena.de/
IMAGE_AA.html#Properties. Studies have shown that amino acids can be
exchanged with each other without compromising changes in the structure
(Azarya-Sprinzak et al., 1997; Benner et al., 1994; Gonnet et al., 1992; Johnson
and Overington, 1993; Jones et al., 1992; Naor et al., 1996). Such exchanges
are possible because amino acids share similar physico- chemical properties, and
changes within similar groups are tolerated (Taylor, 1986). The degree of sub-
stitution at a particular residue position depends on the functional role and the
environmental location of the residue in the folded form of the protein (Azarya-
Sprinzak et al., 1997). Due to this, a number of slightly different sequences may
adopt similar structure (divergent evolution) and function. If sequence, struc-
ture, and function of a set of related proteins are already known then inference
rules can be derived. These rules can be applied to classify a new sequence for
which no structure or function is known. Such inference rules can be a set of
conserved residues like sequence motifs or structural motifs that is present in
all the members in a related set of proteins (Falquet et al., 2002; Guruprasad
and Shivaprasad, 2000; Hofmann et al., 1999; Hutchinson and Thornton, 1996).
The effective means of understanding sequence information coming out of ge-
nomic projects will require assigning structure and function. Protein sequences
that have evolved from a common parent share similar structure and function.
If the parent protein structure is known then one can apply comparative mod-
eling techniques to obtain the geometric information for the unknown protein.
Hence, relating protein sequences to their structural parent or to a known fold
using computational techniques will be critical to handle biological information
effectively.

4. Biological Systems, Signals, and Pathways


Many genes are regulated at a given time inside a cell. Regulatory proteins
switch these genes on or off based on internal or external cue. A complex net-
work of proteins, small organic molecules, and ions facilitates this regulatory
process. For a cell to receive stimuli from the surrounding environment and to
devise appropriate responses, signaling pathways are essential. Biological sys-
tems have evolved with robust dynamic response to a wide variety of stimuli.
Wide positive and negative feedback networks orchestrate such a complex level
function. If one considers the multitude of factors which are capable of elicit-
ing cellular responses, it is not surprising that cellular signaling pathways are
likely to be extremely complex and diverse. Proteins present in the extracel-
lular environment can come from different sources. Those that are secreted by
cells surrounding the ”recipient” cell are known as paracrine signals, those that
are released by organs distant from the recipient cell are known as endocrine
signals, and those released by the cell itself are known as autocrine signals. The
proteins in the extracellular milieu can be divided into three broad categories,
based upon their effects in the cell:(a) those causing an immediate change in cel-
lular metabolism, (b) those eliciting changes in gene transcription, and (c) those
causing fluctuations in electrical conductivity across the plasma membrane.
10 1. Introduction to Bioinformatics

One key aspect of protein binding to the cell surface is specificity. Since the
particular molecules binding to the cell surface are intended to elicit specific
responses, they must be very selective in the pathways that they initiate. At
the same time, it is equally important to consider interactions between different
cellular pathways, as the cell must respond collectively to a variety of stimuli at
any one time. Protein signaling pathways are extremely broad and encompass
many different signal transduction pathways. For example the Notch signaling
pathway is critical in developmental processes co-ordinated by signal transducer
proteins and transcriptional activators, leading to changes at the gene transcrip-
tion level. The Notch pathway is activated upon contact with neighboring cells
expressing Notch ligands. In humans, the ligands that are capable of activating
notch are Delta and Serrate. These ligands are membrane bound; therefore,
close cell proximity is required for activation of Notch pathways. In some ways,
Notch signaling can be considered a ”classical” pathway; the binding of Notch
ligand to Notch receptor ultimately results in the translocation of the Notch
intracellular domain to the nucleus and effects upon gene transcription. Notch
ligands are single- pass transmembrane proteins, which contain multiple epider-
mal growth factors like repeats in the extracellular domain. There are several
such signaling pathways that are responsible for widely observed biological pro-
cesses.

A large catalog of such signaling pathways is available at BioCarta pathway


listing http://www.biocarta.com/. Metabolic systems, immune response sys-
tems, protein transport, cell cycle and development are some of robust processes
that are fundamental to complex biological systems facilitated by macromolec-
ular interactions occurring at different compartments or organelles in the cell.
One of the crucial events during evolution that was responsible for the formation
of a cell was the development of an outer membrane. With further evolution
and selection, the cells of the present day all have a plasma membrane mainly
comprised of phospholipids. Classification of cells as prokaryotes and eucary-
otes is based on the absence or presence of a functional nucleus that contains
DNA. Most cells have a plasma membrane and other organelles such as golgi
apparatus, endoplasmic reticulum, nucleus, mitochondria.

The first organisms on earth were unicellular such as bacteria and protozoa. So
the question that arises is what led to the evolution of multicellular organisms.
With our current knowledge of biology, we can explain the origin and impor-
tance of cell-cell interactions. Cell-cell interactions are crucial and are part of
every aspect of the cell in eukaryotes. These interactions were responsible for
the evolution of multicellular organisms. When we define cell-cell interaction
it means communication of cells for division, differentiation, reproduction, mi-
gration, apoptosis, contact inhibition, etc. There are over 200 types of cells in
the human body broadly classified on the basis of the tissue they are present
in, namely epithelia, connective tissue, nervous tissue, and muscle. Cooperation
among cellular processes is required for the induction of an antibody response
11

in B cells as well as for the sensitization of T cells. In addition, the action


of activated T cells on target cells is cellular interaction. Macrophages are an
essential participant in some of these interactions. The bodys ability to replace
dead cells and repair damage is by two distinct processes namely regeneration
of injured tissue by parenchymal cells and replacement by connective tissue.

The mechanism in both of these processes involves cell growth and differen-
tiation as well as cell-matrix interactions. Several proteins control the timing
of the events in the cell cycle, which is tightly regulated to ensure that cells di-
vide only when necessary. The loss of this regulation is the hallmark of cancer,
which is also due to loss of control in contact inhibition. Major control switches
of the cell cycle are cyclin-dependent kinases. Each cyclin-dependent kinase
forms a complex with a particular cyclin, a protein that binds and activates
the cyclin-dependent kinase. The kinase part of the complex is an enzyme that
adds a phosphate to various proteins required for progression of a cell through
the cycle. These added phosphates alter the structure of the protein and can
activate or inactivate the protein, depending on its function. There are specific
cyclin-dependent kinase/cyclin complexes at the entry points into the G1, S,
and M phases of the cell cycle, as well as additional factors that help prepare
the cell to enter S phase and M phase. Normal mammalian cells show contact
inhibition; that is, they respond to contact with other cells by ceasing cell divi-
sion. Therefore, cells can divide to fill in a gap, but they stop dividing as soon
as there are enough cells to fill the gap. This characteristic is lost in cancer
cells, which continue to grow after they touch other cells, causing a large mass
of cells to form.

5. The Outline of this Primer


In recent years, biological science has seen two fields emerge, genomics and
proteomics. These fields are rapidly advancing, and bioinformatics provides the
tools to analyze and interpret the vast amount of data that is surging. Handling
and analyzing biological information is the subject of computational biology and
bioinformatics. Computational tasks faced in biology can be broadly divided
into selection and classification problems. Classification involves assigning a
member to a set or subset that has some defined properties. For example, given
a DNA sequence, the classification algorithms attempt to address whether the
protein it codes for is a tyrosine kinase. On the other hand, selection algorithms
are involved in data mining for example, identifying a DNA repair-related pro-
tein or a serine protease from genomic sequence. To deal with these types of
problems first of all we need data sets that are accumulated and curated by
experts. Next we need algorithms and specific software tools that can provide
necessary search, score, and analyze biological information.

This practical introductory handbook contains 6 parts:: Part 1: Intoduction,


Part 2: Introduction to Bioinformatics Problems, Part 3: Introduction to Bioin-
formatics Computations, Part 4: Some Widely Used Methods & Models in
Bioinformatics , Part 5: Bioinformatics Tools, Part 6: Bioinformatics: Current
12 1. Introduction to Bioinformatics

and Future

Part 1 is the introduction with 8 chapters. Chapter 1 is the present chapter


of Introduction. Chapter 2 briefly introduces Cell Biology , Chapter 3 briefly
introduces Genetics and Genomics, Chapter 4 Introduces Proteomics, Chapter 5
describes some important Model Organisms , Chapter 6 describes the necessary
Computing Fundamentals for Bioinformatics, Chapter 7 elaborates the Math
Primer for Bioinformatics , Chapter 8 talks about the Biological Processes, Ex-
perimental Methods & Machinery .

II Introduction to Bioinformatics Problems


9 DNA & Protein Sequencing
10 Genome Mapping
11 Sequences Alignment
12 Gene Prediction
13 Genome Analysis
14 Phylogenetic Analysis
15 Protein Folding
16 Structural Bioinformatics & Drug Discovery

III Introduction to Bioinformatics Computations


17 Statistical and Probabilistic Methods in
18 Computational Methods in Bioinformatics
19 Bioinformatics Data Mining
20 Some Algorithms in Bioinformatics
20.1 BLAST . . . . .
20.2 FASTA . . . . .
20.3 CLUSTALW . .
20.4 PHD . . . . . . .
20.5 Predator . . . . .
20.6 TRILOGY . . .
20.7 Gibbs Sampler .
20.8 DALI . . . . . .

IV Some Widely Used Methods & Models in Bioinformatics


21 Dynamic Programming And Bioinformatics
22 Neural Network And Bioinformatics
23 Hidden Markov Model (HMM) And Bioinformatics
24 Genetic Programming And Bioinformatics

V Bioinformatics Tools
25 Python - Primer Programming Language for Bioinformatics
26 Python And Bioinformatics
13

27 Tools and Libraries for Bioinformatics

VI Bioinformatics : Current & Future


28 Prominent Research Areas in Bioinformatics
29 Endless Horizon of Bioinformatics: Future Directions
30 The Crazy Corner with ALL WILD Imaginations

Thought of Mind
Chapter Layout-
• Definitions / Background / History
• Preliminaries needed for getting more use of the following text in the
Bioinformatics Primer

– Molecular Biology Preliminary: Amino Acids & Proteins; Structures


of Proteins
– Computer Science Preliminary
– Probability Preliminary

• Overview of the following chapters ...


14 1. Introduction to Bioinformatics
Chapter 2

Introduction to Cell Biology

—Farjana Khatun
Introduction to Cell Biology..1 Introduction to Cell Biology. Introduction to
Cell Biology. Introduction to Cell Biology. Introduction to Cell Biology. In-
troduction to Cell Biology. Introduction to Cell Biology. Introduction to Cell
Biology. Introduction to Cell Biology. Introduction to Cell Biology. Introduc-
tion to Cell Biology. Introduction to Cell Biology. Introduction to Cell Biol-
ogy. Introduction to Cell Biology......... ............. ....................... ....................
....................... ........ ......... ........... ......... Introduction to Cell Biology In-
troduction to Cell Biology Introduction to Cell Biology Introduction to Cell
Biology Introduction to Cell Biology Introduction to Cell Biology Introduction
to Cell Biology Introduction to Cell Biology Introduction to Cell Biology In-
troduction to Cell Biology Introduction to Cell Biology Introduction to Cell
BiologyI introduction to Cell Biology Introduction to Cell Biology.

2.1 Cell
The cell theory was developed by Matthias Jakob Schleiden and Theodor Schwann
Cell: Cell is the Building Block of
in 1839, states that all organisms are composed of one or more cells. All cells
an organism
come from preexisting cells. Vital functions of an organism occur within cells,
and all cells contain the hereditary information necessary for regulating cell
functions and for transmitting information to the next generation of cells.

Cell is the building block of all living organism. It is the functional unit of life.
There are millions of different types of cells. There are some organisms having
single cell such as amoeba and bacterial cells. Human body consists of different
types of cells - brain cells, skin cells, liver cells, stomach cells etc. All these cells
have a unique feature but perform different functions. According to the struc-
ture, there are two types of cells - eukaryotic cell (example: fungi, mammals,
1 The Chapter Preamble will be written latter

15
16 2. Introduction to Cell Biology

birds, fish, invertebrates, mushrooms, plants etc) and prokaryotic cells (exam-
ple: bacteria, amoeba, cyanobacteria etc). Among prokaryotes, most widely
studied organism is E. coli. Eukaryotic cells have well organized nucleus and
prokaryotic cells contain undefined nucleus. On the basis of function, there are
two types of cells - somatic cell (forming the body of the organism) and germ
cell (Regulate the production of sperm, eggs i.e. involve in reproduction).

2.1.1 Cell Structure


Both the animal cell and plant cell possess similar organelles but having some
differences (plant cells have rigid cell wall and chloroplasts where as animal cells
lack of it). Organelles are small structures that help to carry out the function
of the cell.

Figure 2.1: A Typical Animal (Eukaryotic) Cell

There are different types of organelles in the cell. Among them most impor-
tant organelles are -

Nucleolus- ribosomes are synthesized from it.

Nucleus- brain or information centre of the cell. It is spherical in shape;


remain separated from others by nuclear membrane and carry chromosomes. It
is the place where DNA is replicated and RNA synthesis (mRNA) is taken place
through the process of transcription.
2.1. Cell 17

Ribosome- protein production machine. An organelle, composed of complex


of RNA and protein molecules, remain adjacent to endoplasmic reticulum. Each
ribosome has two units, one is small and another is large, remain separated in
the cytoplasm until they join to begin in the process of translation.

Figure 2.2: Protein Synthesis is Started in the Ribosome

Vesicle- secretes hormones, neurotransmitter etc. that are packed into the
golgi apparatus.

Rough Endoplasmic Reticulum (ER)- It is called rough ER due to the


presence of numerous ribosomes on its surface. It acts as a transport network
for the molecules (proteins) targeted for specific modification or destination
throughout the cell.

Golgi Apparatus- Usually it is a stack of membrane vesicles, found in the


cell near about rough endoplasmic reticulum. Its primary function is to process
and package of macromolecules (proteins and lipids) that are synthesized in the
cell. It plays a vital role in the processing of proteins.

Cytoskeleton- organized network, composed of protein filaments, responsible


for the shape of cell, internal movement of cell organelles, cell locomotion as well
as muscle fibre contraction.

Smooth Endoplasmic Reticulum- It is called smooth ER due to lack of


ribosomes on its surface. It plays different role depending upon the type of cell.
Its functions are mostly linked to lipid and hormone synthesis.

Mitochondria- self replicating organelle and power house of the cell.

Vacuole- most commonly found in plant cells. Its function is to store nutri-
ents, waste products etc.

Cytoplasm- It is jelly like material that hold all the organelles of the cell.
18 2. Introduction to Cell Biology

Lysosome- cellular digestive system, found in animal cells but rare in plant
cells, contains digestive enzymes that digest extra-cellular organelles, engulf
virus and bacteria.

Centrosome- It organizes the cytoskeleton of the cell and plays an impor-


tant role during the cell division by forming mitotic spindle for separating the
replicated chromosomes.

All these organelles are surrounded by the cell membrane, composed of phospho-
lipids and protein and semi-permeable in nature. Mitochondria and chloroplast
(found in plants. It helps plants to produce their foods through the process of
photosynthesis) have their own genome in circular plasmids, which is separate
and distinct from the nuclear genome of a cell.

Tissue Tissues are nothing but the collective of cells. Similar types of cells
work together to perform a specific function.

Organ Organ is next to the tissues. An organ is a structure that contains


at least two different types of tissues functioning together to serve a specific
purpose. For example, there are different organs in the body: liver, kidneys,
heart, even skin. In fact, skin is the largest organ in the human body. Organ
systems (cardiovascular system, skeletal system, digestive system etc.) are the
collective form of organs to provide a common function.

Organism Organisms get their structural form with the combination of dif-
ferent organ systems.

The above describes the hierarchical flow of an organisms basic structure. But
how an organism is originated from a single cell? The mechanism can be clearly
understood with the evaluation of a human being from a single zygote cell.
................................

2.1.2 Cell Cycle & Cell Division Cycle


It is the series of events that takes place in a cell leading to its division and dupli-
cation. Binary fission/amitosis is the process of cell division in the prokaryotic
cells which is quite simple and in eukaryotic cells, cell division occurs through
the complex process known as mitosis. In eukaryotes, Cell cycle can be divided
in two periods:

• Interphase: During this period, cell grows, accumulates nutrients needed


for mitosis and duplicates its genetic material (DNA). Interphase can be
divided into 3 sub-periods

– G1 Phase: (cell takes preparation for DNA synthesis)


2.1. Cell 19

– S Phase: (DNA replication occurs in this phase)


– G2 phase: (check point that ensures, cell is ready to enter in M
phase and divide)

Figure 2.3: Cell Cycle

• Mitosis (M phase): During this phase, the cell splits itself into two
distinct cells (daughter cells). Mitosis process can be described by five
consecutive steps known as

– Prophase
– Prometaphase
– Metaphase
– Anaphase
– Telophase

A resting phase (G0 Phase) where cell leaves the cell cycle and stop divid-
ing. The cell-division cycle is a vital process by which a single-celled fertilized
egg develops into a mature organism as well as the process by which hair, skin,
blood cells, and some internal organs are renewed.

(There are two types of cell division in eukaryotic cells: mitosis and meiosis.
Mitosis is the process by which a cell is divided into two new cells. The genetic
material in the new cells is 100% identical to that of mother cell. Thats why,
mitosis is also called as equational division. On other hand, in meiosis, four new
cells are formed from a cell and each cell carry 50% of genetic material compared
to that of mother cell. Thats why, meiosis is known reductional division.)
20 2. Introduction to Cell Biology

2.2 Chromosome
Chromosome is the carrier of genetic information from one generation to an-
other. It is present in the nucleus of a cell having thread-like (string) structures
made of DNA and Protein. The shape and number of chromosome vary widely
among the organisms. For example, (i) eukaryotic cells have large linear chro-
mosomes where as prokaryotic cells contains small circular chromosomes, (ii)
46 chromosomes are present in human being where as ape, fruit fly have 48, 8
chromosomes respectively. Human being has 23 pairs of chromosomes. Each
pair is inherited from parents - one from mother and another from father. One
from 23 pairs is responsible for determining sex - XY for male and XX for female.
Chromosomes are visible under electron microscope when stained with certain
dyes that reveal a pattern of light and dark bands (karyotype analysis). Chro-
mosomes can be distinguished from each other on the basis of size and banding
pattern difference. Major chromosomal abnormalities such as missing, extra
copies, gross break and rejoining etc. can be detected by karyotype analysis.
Karyotypic analysis reveals that diseases such as
Down Syndrome is due to the presence of an extra chromosome (third copy
of chromosome) in chromosome 21.
Turner Syndrome is due to loss of one sex chromosome between two.
Recombination and Manipulation of Chromosome, therefore, has a pivotal
role in genetic diversity.

2.3 DNA (DeoxyriboNucleic Acid)


DNA is present in the chromosomes of a cell nucleus. The mystery of life is coded
in DNA. It defines an individuals over all characteristics such as sex, physical
facts/appearance (whether anyone will be tall or short, black or white, color of
hair and eye and many others), behavior etc. It is capable of self replication
(making copy of own). It is composed of two long chains (double strands) of
nucleotides twisted into a double helix (like rounded stair).

2.4 RNA (RiboNucleic Acid)


RNA is mainly come from DNA. Like DNA, it is composed of nucleotides but
having single chain (single strand). Both DNA and RNA are also known as
nucleic acid. Some structural differences also present between DNA and RNA.
We will discuss about that in nucleotide.

2.5 Nucleotide
Nucleotides are the building blocks (monomer) of DNA and RNA. Three com-
ponents are essential to form a nucleotide. These are shown below through flow
2.5. Nucleotide 21

diagram

Figure 2.4: Nucleotide by Parts

Nitrogen Base: There are five types of nitrogen bases named adenine (A),
Guanine (G), Cytosine (C), Thymine (T) and Uracil (U). A, T, C, G are
found in DNA and A, U, C,G are found in RNA.

[S] means Sugar

Figure 2.5: Structure of A, T, C, G and U

In DNA, adenine is bonded with thymine by two hydrogen bonds and


cytosine is bonded with guanine with three hydrogen bonds.
22 2. Introduction to Cell Biology

A=T C≡G

Phosphate group: Nucleotide contain phosphate groups from one to three.


On the basis of the number of phosphate groups, nucleotides are named.

Figure 2.6: Phosphate Molecule

Figure 2.7: Pentose Sugar

Pentose Sugar: Pentose/Ribose sugar is one of the three components of nu-


cleotide having five carbon atoms ( C1’ to C5’). In RNA, C2’ has an OH
group whereas in DNA, C2’ lack of oxygen atom having only hydrogen
atom. Due to lack of oxygen atom at C2’ position in the sugar molecule,
it is called ”Deoxyribose sugar”. Deoxyribose sugar is present in DNA.
At C1’ position of sugar molecule, nitrogen base (A, T, C, G) is attached
and phosphate group is attached to C5’ position of the sugar molecule. So
the basic structure of nucleotide:
From the structure of nucleotide, we see that it has two ended positions
named - 3’ end and 5’ end. When another nucleotide is attached to it, 3’
end of new one will attach to its 5’ end.
In this way, nucleotides are attached to one another and form nucleotide
sequences. These sequences also known as DNA, DNA sequence.

To understand the sequential arrangement of nucleotides in DNA, let us


2.5. Nucleotide 23

Figure 2.8: Nucleotide

Figure 2.9: Nucleotide Sequence

imagine the structure of a ladder. To make a ladder, we need two lengths


of wood or metal (backbone of the ladder) that are joined together by
steps.
Sugar and Phosphate group of the nucleotide make the backbone/strand
of DNA sequence like the wood or metal of the ladder. Nitrogen base of a
strand form a base pair by making bond with the base of opposite strand.
these base pairs (A = T and C = G) perform the functions like the steps
24 2. Introduction to Cell Biology

Figure 2.10: DNA Structure

of ladder. Now we have to twist the ladder to imagine the 3-D structure
of DNA.

In a sequence of nucleotides, sugar and phosphate group (form the back-


bone of a sequence) of nucleotides are same, only change in nitrogen bases
(A, T, C, G, U) and remain two positions (3’ end and 5’ end) where new
nucleotide can anchor and thats why, sequences are written as

3’...GATGCTAGGCA...5’

5’...CTACGATCCGT...3’

More to Come!!!
• How a Cell Becomes a Human?
• Central Dogma of Life or Biology
• Protein Synthesis Process

• Cell Division
Chapter 3

Introduction to Genetics
and Genomics

—Farjana Khatun

Introduction to Genetics and Genomics . . . Introduction to Genetics and


Genomics . . . Introduction to Genetics and Genomics . . . Introduction
to Genetics and Genomics . . . Introduction to Genetics and Genomics . .
. Introduction to Genetics and Genomics . . . Introduction to Genetics and
Genomics . . . Introduction to Genetics and Genomics . . . Introduction
to Genetics and Genomics . . . Introduction to Genetics and Genomics . .
. Introduction to Genetics and Genomics . . . Introduction to Genetics and
Genomics . . . Introduction to Genetics and Genomics . . . Introduction
to Genetics and Genomics . . . Introduction to Genetics and Genomics . .
. Introduction to Genetics and Genomics . . . Introduction to Genetics and
Genomics . . .

3.1 Concept of Gene


Darwin used the term Gemmule to describe a microscopic unit of inheritance
and later it is known as Chromosomes had been observed separating out dur-
ing cell division by Wilhelm Hofmeister as early as 1848.

The idea that chromosomes are the carriers of inheritance was expressed in
1883 by Wilhelm Roux.

The modern concept of the gene first originated by a nineteenth century Augus-
tinian monk Gregor Mendel who systematically studied heredity in pea plants
(Pisum sativum) and hypothesized a factor that conveys traits from parent to
offspring. He spent over 10 years of his life on one experiment. Although he did

25
26 3. Introduction to Genetics and Genomics

not use the term gene, he explained his results in terms of inherited character-
istics. Mendel was also the first to hypothesize

• Independent assortment

• Distinction between dominant and recessive traits,

• Distinction between a heterozygote and homozygote and

• Difference between genotype (the genetic material of an organism) and


phenotype (the visible traits of that organism).

Although Mendel’s work was largely unrecognized after its first publication
in 1866, it was rediscovered in 1900 by three European scientists, Hugo de Vries,
Carl Correns, and Erich von Tschermak, who had reached similar conclusions
from their own research.

Danish botanist Wilhelm Johannsen coined the word ”gene” in 1909 to de-
scribe these fundamental physical and functional units of heredity, while the
related word genetics was first used by William Bateson in 1905. The word
was derived from Hugo de Vries 1889 term pangen for the same concept, it-
self a derivative of the word pangenesis coined by Darwin (1868). The word
pangenesis is made from the Greek words pan (a prefix meaning ”whole”,
”encompassing”) and genesis (”birth”) or genos (”origin”).

3.2 Discovery Chronology Revealing the Con-


cept of Central Dogma of Life
• In 1910, Thomas Hunt Morgan showed that genes reside on specific chro-
mosomes. He later showed that genes occupy specific locations on the
chromosome. With this knowledge, Morgan and his students began the
first chromosomal map of the fruit fly Drosophila

• In 1928, Frederick Griffith showed that genes could be transferred.

• In 1941, George Wells Beadle and Edward Lawrie Tatum showed that
mutations in genes caused errors in specific steps in metabolic pathways.
This showed that specific genes code for specific proteins, leading to the
”one gene, one enzyme” hypothesis.

• Oswald Avery, Colin Munro MacLeod, and Maclyn McCarty showed in


1944 that DNA holds the gene’s information.

• In 1953, James D. Watson and Francis Crick demonstrated the molecu-


lar structure of DNA. Together, these discoveries established the central
dogma of molecular biology, which states that proteins are trans-
lated from RNA which is transcribed from DNA.
3.3. Discovery of Gene Sequence 27

3.3 Discovery of Gene Sequence


In 1972, Walter Fiers and his team at the Laboratory of Molecular Biology of
the University of Ghent (Belgium) were the first to determine the sequence
of a gene: the gene for Bacteriophage MS2 coat protein.

Richard J. Roberts and Phillip Sharp discovered in 1977 that genes can be
split into segments. This led to the idea that one gene can make several pro-
teins. Recently (as of 2003-2006), biological results let the notion of gene appear
more slippery. In particular, genes do not seem to sit side by side on DNA like
discrete beads. Instead, regions of the DNA producing distinct proteins may
overlap, so that the idea emerges that ”genes are one long continuum”.

3.4 Central Dogma of Biology


From the discovery of molecular structure of DNA, we knew that DNA is con-
verted to RNA, and then from RNA, proteins are synthesized. These involve
some processes named replication, transcription and translation.

Figure 3.1: Central Dogma of Biology

Replication: DNA can create its own copy through the process of replication
or copying mechanism.

Transcription: It is the process through which a single strand of RNA


molecule named mRNA (messenger RNA) is produced from a DNA strand. In
eukaryotes, transcription occurs in the nucleus of the cell. Transcription is car-
ried out by an enzyme known as RNA polymerase. To initiate transcription, the
polymerase first recognizes and binds a promoter region of the gene. Then it
reads the template strand in the 3’ to 5’ direction and synthesizes the RNA from
5’ to 3’ direction. Here the sequence of nucleotides of RNA is complementary
to that of DNA from where it has been synthesized known as template strand.

The RNA molecule produced by the polymerase is known as primary transcript


and must undergo post-transcriptional modifications before being exported to
the cytoplasm for translation. Primary transcript undergoes splicing, removes
the unnecessary codes (non-coding regions) known as introns and attaches the
coding segments known as exons to form mRNA. The nucleotide sequence of
mRNA is known as coding strand.
28 3. Introduction to Genetics and Genomics

DRAW IT
DRAW IT
DRAW IT

Figure 3.2: Figure to be Drawn

Translation: After the transcription process in the nucleus, mRNA comes to


the cytoplasm. mRNA binds with the ribosomal units and act as a template for
the synthesizing a new protein. The genetic code (three consecutive nucleotides
known as codon) is read via interaction with specialized RNA molecules called
tRNA (transfer RNA). Each tRNA has three unpaired bases known as anticodon
that complementary to the codon it reads. The tRNA is also covalently attached
to the amino acid specified by the complementary codon. When the tRNA binds
to its complementary codon in an mRNA strand, the ribosome ligates its amino
acid cargo to the new polypeptide chain. So the process of synthesizing a protein
from mRNA is known as translation. During and after its synthesis, the protein
must fold to its active three-dimensional structure before it can carry out its
cellular function.

Figure 3.3: Synthesis of a Protein from mRNA by Translation(part-a)

3.5 Human Genome Project


Human genome project is funded by the National Institutes of Health and the
department of Energy. Its goal was to transcribe the entire human genome, i.e.
full set of genomic material found in chromosomes. Begun in 1990, the project
was completed in 2003 with 99 percent of the entire genome, listed as finished
and high quality sequence (a complete sequence of nucleotides with no gaps or
ambiguities and an error rate of less than one base per 10,000). The final HGP
papers were published in 2006.
3.5. Human Genome Project 29

Figure 3.4: Synthesis of a Protein from mRNA by Translation(part-b)

The first step in accessing the human genome was the preparation of a map
of the individual chromosome by karyotyping. Each chromosome has character-
istics banding pattern when stained with special dyes. The patterns are useful
as reference point for the preparation of more detailed genetic maps. Abnormal
patterns are the characteristics of some genetic disorders and several cancers.

The following are the highlights of the human genome projects:

• All the chromosomes (23 pairs) have been completely sequenced.


• The total number of genes is now estimated as 25,000-30,000 genes. Al-
most 20,000 protein coding genes are confirmed and additional 2188 are
predicted on the basis of DNA segments.

Defining a gene is not straightforward. For example, small genes can easily
be overlooked in a nucleotide sequence, a gene may code for more than one
30 3. Introduction to Genetics and Genomics

protein, and two or more genes can overlap.

• Although more than 99 percent of human nucleotide bases are same in


all people, there are 1.4 million single base differences (single Nucleotide
Polymorphisms - SNPs). Some of these SNPs are associated with specific
diseases.
• Roughly 10,000 different single gene disorders have been described. Most
are rare but collectively they may affect 1 in every 200 births. Over 900
of these disorders have been mapped on the genome. Genetic screening
and diagnostic tests are now performed to determine genetic abnormalities
and disorders linked to them.

The completion of the human genome sequences has stimulated new ap-
proaches for diagnosing diseases and predicting disease susceptibility. In 2006,
the gene and Environment Initiative (GEI) has been launched by a joint collab-
oration of the National Institute of Environmental Health Services (NIEHS) and
the National Human Genome Research Institute (NHGRI) to understand the
link between genes, environment and why certain individuals develop diseases.
They conduct genetic studies of individuals with specific diseases and their per-
sonal exposure to environmental factors such as sun and chemicals, diet and
physical activity.

The Cancer Genome Atlas (TCGA) has been launched in 2006 by the NHGRI
and the National cancer institute with a immediate goal of compilation of an at-
las of genome changes (mutations) in three tumors: brain cancer (glioblastoma),
lung cancer and ovarian cancer.

3.6 Genome
The total complement of genes in an organism or cell is known as its genome.
Genes that appear together on one chromosome of one species may appear on
separate chromosomes in another species. The study of genome is known as
Genomics.

Cells or organisms with only one copy of each chromosome are called hap-
loid; those with two copies are called diploid; and those with more than two
copies are called polyploid.

Pseudogenes: It consists of sequences that are related to functional genes


but cannot be translated into functional proteins.

Genome Maps: It is the graphical representation of genome that provide


information about the location of genes - their sequence along a chromosome
and the distances between them. There are two type of genome maps: genetic
maps and physical maps
3.7. Common Terms used in Genetics 31

• Paralogs
• Orthologs
• Xenologs
• Genetic Code

3.7 Common Terms used in Genetics


Chromosomes contain DNA and Genes are functional segments of DNA. Every
nucleated somatic cell in our body carries copies of the original 46 chromo-
somes when we were a zygote. Those chromosomes and their component genes
constitute our genotype.

Genotype: It is the internally coded, inheritable genetic information of an


organism which determines hereditary potentials and limitations of an individ-
ual from embrayonic stage to adult life. Each individual has a unique genotype
(exception in identical twins who are derived from the same fertilized egg).

Our genotype is derived from the genotypes of our parents. Yet we are not
exact copy of either parents or easily identifiable mixture of their characteristics.

The instructions that are present in your genotype finally determine the anatom-
ical (related to shape and size) and physiological (related to functions of differ-
ent parts of the body) characteristics to make you a unique individual. Those
anatomical and physiological characteristics constitute your phenotype i.e. ap-
pearance (eg. hair and eye color, skin tone, foot size etc) and behavior.

Phenotype: This is the ”outward, physical manifestation” of the organism.


These are the physical parts, the sum of the atoms, molecules, macromolecules,
cells, structures, metabolism, energy utilization, tissues, organs, reflexes and
behaviors; anything that is part of the observable structure, function or behavior
of a living organism.

Genotyping: is the process of elucidating the genotype of an individual with a


biological assay. Also known as a genotypic assay, techniques include PCR, DNA
fragment analysis, allele specific oligonucleotide (ASO) probes, DNA sequenc-
ing, and nucleic acid hybridization to DNA microarrays or beads. Several com-
mon genotyping techniques include restriction fragment length polymorphism
(RFLP ), terminal restriction fragment length polymorphism (t-RFLP ), ampli-
fied fragment length polymorphism (AFLP ), and multiplex ligation-dependent
probe amplification (MLPA).

DNA fragment analysis can also be used to determine such disease causing
genetics aberrations as microsatellite instability (MSI ), trisomy or aneuploidy,
32 3. Introduction to Genetics and Genomics

and loss of heterozygosity (LOH ). MSI and LOH in particular have been as-
sociated with cancer cell genotypes for colon, breast and cervical cancer. The
most common chromosomal aneuploidy is a trisomy of chromosome 21 which
manifests itself as Down syndrome. Current technological limitations typically
allow only a fraction of an individuals genotype to be determined efficiently.

Homologous Chromosomes: Each cell contains 23 pairs of chromosomes.


One member of each pair is contributed by spermatozoa (father) and the other
by the ovum (mother). The two members of each pair are known as homologous
chromosomes.

Twenty two of those pairs are called autosomal chromosomes. Most of the
genes of autosomal chromosomes affect the somatic characteristics such as the
hair color, skin pigmentation etc. The chromosomes of the 23rd pair are called
sex chromosomes; one of their functions is to determine whether the individual
is male or female.

Figure 3.5: Normal Human Karyotype


3.7. Common Terms used in Genetics 33

Locus: The two chromosomes in a homologous autosomal pair have the same
structure and carry genes that affect the same traits. Suppose that one member
of the pair contains three genes in a row, first gene determining hair color, the
second eye color and the third skin pigmentation. The other chromosome carries
genes that affect the same traits, and the gene are in the same sequence and also
located at equivalent positions on their respective chromosome. The position of
the genes on a chromosome is called locus.

Allele: The two chromosomes in a pair may not carry the same form of each
gene. The various forms of a given gene are called alleles. These alternate forms
of gene determine the precise effect of the gene on phenotype.

Homozygous: If the two chromosomes of a homologous pair carry the same


allele of a particular gene, then you are called homozygous for the trait affected
by the gene. For example, if you receive a gene for curly hair from your father
and a gene for curly hair from your mother, you will be homozygous for curly
hair. About 80 percent of an individuals genome consists of homozygous alleles.

Heterozygous: The chromosomes of a homologous pair have different ori-


gins, one paternal and the other maternal. They do not necessarily carry the
same alleles. When you have different alleles for the same gene, you are het-
erozygous for the trait determined by the gene. The phenotype that results from
a heterozygous genotype depends on the nature of the interaction between the
corresponding alleles. For example, if you received a gene for curly hair from
your father and a gene for straight hair from your mother, the type of your hair
will depend on the relationship between the alleles. Your hair may be curly or
straight or even wavy.

Interaction between Alleles

Strict Dominance: An allele that is dominant will be expressed in the


phenotype. An allele that is recessive will be expressed in the phenotype only
if that same allele is present on both chromosomes of a homologous pair.

Incomplete Dominance: Heterozygous alleles produce a phenotype


that is distinct from the phenotypes of individuals who are homozygous for
one allele. For example, a gene that affects the shape of the red blood cells.
Individuals with homozygous

Co-dominance:
34 3. Introduction to Genetics and Genomics
Chapter 4

Introduction to Proteomics

—Farjana Khatun

Introduction to Proteomics . . . Introduction to Proteomics . . . Introduction


to Proteomics . . . Introduction to Proteomics . . . Introduction to Proteomics
. . . Introduction to Proteomics . . . Introduction to Proteomics . . .
Introduction to Proteomics . . . Introduction to Proteomics . . . Introduction
to Proteomics . . . Introduction to Proteomics . . . Introduction to Proteomics
. . . Introduction to Proteomics . . . Introduction to Proteomics . . .
Introduction to Proteomics . . . Introduction to Proteomics . . . Introduction
to Proteomics . . . Introduction to Proteomics . . . Introduction to Proteomics
. . .

4.1 Introduction
The field of Proteomics is much bigger/wider than genomics. Primarily the
term ”Proteomics” means analysis of protein profile of tissues. Proteome refers
to all proteins present in a species. Genome is a constant feature of an organism
whereas proteome varies with the nature of the tissue, state of development, dis-
ease or effect of drugs. So it varies with time. The primary structure of proteins
(sequence of amino acids) is determined with the help of genomic data (sequence
of nucleotides) through the process of transcription (synthesis of mRNA from
DNA) and translation (protein synthesis from mRNA) respectively. Then pri-
mary structure of protein is converted to its secondary and finally in 3 − D
structure through the process of post-translation. Post-translational process
has a pivotal role in determining the destination and function of all synthesized
proteins. Primary structure determination of a protein is relatively easy. But
how it gets its 3 − D structure and what is its absolute 3 − D structure are the
headache of the researchers. Because it reveals ............

Benefit what we get from proteomics???

35
36 4. Introduction to Proteomics

To know about the proteomics, it is necessary to learn about amino acids;


basic structure of protein; primary, secondary and tertiary structure of proteins
factors/causes that lying behind its structural conversion.

4.2 Protein
The genetic code is the sequence of three bases (nucleotides) in the DNA se-
quences containing information of linear sequence of amino acids (known as
primary structure of protein). The primary structure of a protein determines
how it can fold and how it interacts with other molecules in the cell to perform
its function. The primary structure of all the diverse proteins are synthesized
from 20 amino acids arranged in a linear sequence determined by the genetic
code.

4.2.1 Amino Acids


Amino acids are critical to life, and have many functions in metabolism. One
particularly important function is to serve as the building blocks of proteins,
which are linear chains of amino acids. Amino acids can be linked together
in varying sequences to form a vast variety of proteins. Twenty amino acids
are naturally incorporated into polypeptides and are called proteinogenic or
standard amino acids. Amino acids are classified into two groups (essential and
nonessential) on the basis of nutritional requirements. Essential amino acids
[isoleucine, lysine, leucine, methionine, phenylalanine, threonine, tryptophan,
valine, (histidine, arginine - essential only for children)] are not synthesized in
the body and must be supplied in diet, where as the nonessential amino acids
can be synthesized in the body in need. Lack of essential amino acids leads to
various abnormalities due to abnormality in protein synthesis. The proteins of
animal origin like egg,meat ,fish, contain all essential amino acids.

4.2.2 General properties of Amino acids


4.2.2.1 Structure

Each of the amino acids used for protein synthesis has the same general struc-
ture. It contains a carboxylic acid group, an amino group, a hydrogen atom and
a chemical group called a side chain that is different from each amino acid are
attached to α-carbon.
In protein, these amino acids are joined into linear polymer called polypep-
tide chain through peptide bonds between the carboxyl group of one amino acid
and the amino group of next amino acid.
4.2. Protein 37

Amino Group N H2 C COOH

Figure 4.1: Structure of Animo Acid (INCOMPLETE...)

Figure 4.2: Protein formation from amino acids

4.2.2.2 Zwitter Ion


In solution, the free amino acid exists as Zwitter ion (ions in which amino group
is positively charged and carboxyl group is negatively charged). Zwitter ion
formation depends upon the pH of the medium where amino acids are existed.

4.2.2.3 Isomerism
Of the standard a-amino acids, all but glycine can exist in either of two optical
isomers, called L or D amino acids, which are mirror images of each other.
While L-amino acids represent all of the amino acids found in proteins during
translation in the ribosome, D-amino acids are found in some proteins produced
by enzyme posttranslational modifications after translation and translocation
to the endoplasmic reticulum, as in exotic sea-dwelling organisms such as cone
38 4. Introduction to Proteomics

N H3+ C COO−

Figure 4.3: Structure of Zwitter Ion (INCOMPLETE...)

snails. They are also abundant components of the peptidoglycan cell walls of
bacteria, and D-serine may act as a neurotransmitter in the brain.

4.2.2.4 Classification of Amino acids


Twenty amino acids have different side chains and display different physico-
chemical properties (polarity, acidity, basicity, ability to form hydrogen bonds)
classified according to the chemical properties of the side chains (R).

Amino acids can be classified into two categories on the basis of polarity.

1. Non polar or Hydrophobic Amino Acids: They have equal number


of amino and carboxyl groups and is neutral. These amino acids are
hydrophobic and do not have charge on the ’R’ group. The amino acids
which belong to this group are alanine, valine, leucine, isoleucine, phenyl
alanine, glycine, tryptophan, methionine and proline.

2. Polar amino acids: (hydrophilic amino acids) can be subcategorized


into.
Polar amino acids with no charge: These amino acids do not
have charge on the ’R’ group. These amino acids take part in hydrogen
bonding of protein structure. The amino acids which belong to this group
are - serine, threonine, tyrosine, cysteine, glutamine and asparagine.
Polar amino acids with positive charge (Basic amino acids):
Polar amino acids exhibiting positive charge have more amino groups as
compared to carboxyl groups making it basic. The amino acids, which
have positive charge on the ’R’ group are present. The examples are
lysine, arginine and histidine.
Polar amino acids with negative charge (Acidic amino acids):
Polar amino acids with negative charge have more carboxyl groups than
amino groups making them acidic. The amino acids, which have negative
charge on the ’R’ group are placed in this category. They are called as
dicarboxylic mono-amino acids. They are aspartic acid and glutamic acid.
4.2. Protein 39

Figure 4.4: Amino Acids Classification

The graph above nicely demonstrates the location of the 20 amino acids in
different regions of a protein tertiary structure. The vertical axis shows the
fraction of highly buried within the protein core (inaccessible for water) amino
acid residues, while the horizontal axis shows the amino acid names in one-letter
code. Apparently there is very small fraction of buried charged residues, while
in the case of the non-polar amino acids the fraction is very high.

The propensity of amino acid residues to be (or not to be) in contact with polar
solvent largely controls the distribution of each of the 20 amino acids within the
volume of a protein structure. Thus, most protein molecules have a hydrophobic
40 4. Introduction to Proteomics

Figure 4.5: Amino Acids Location Distribution

core, which is not accessible to solvent and is built up by hydrophobic amino


acids. On the other hand, polar and charged amino acids preferentially cover the
surface of the molecule and are in contact with the solvent. Very often they also
interact with each other: positively and negatively charged amino acids form so
called salt bridges between each other, while polar amino acid side chains get
involved in hydrogen bonding with side chains or main chain atoms and with
water. Since these interactions are crucial for the stabilization of the protein
tertiary structure, they are normally conserved within a protein family. (http:
//www.proteinstructures.com/Structure/Structure/amino-acids.html).

4.3 The Structure of Proteins


Proteins are formed from chains of amino acids, and the nature of the amino
acid side chains has significant influence on the topography of the protein. The
bonds between amino acid side chains generate a complex protein structure,
which is considered in four stages: primary, secondary, tertiary, and quaternary.

4.3.1 Primary Structure


The primary structure of a protein refers to the sequence of amino acids that
make up the protein. The bonds considered in the primary structure are the
peptide bonds between each amino acid.
4.3. The Structure of Proteins 41

The primary structure is the linear order of amino acid residues along the
polypeptide chain. It arises from covalent linkage of individual amino acids
via peptide bonds.

Ala-Glu-Glu-Ser-Ser-Lys-Ala-Val-Lys-Tyr-Tyr-Thr-...
A—-E—E—S—S—K—A—V—K—Y—Y—T-...

Figure 4.6: Single- and three-letter codes for amino acids of a primary sequence

Every protein is defined by a unique sequence of residues and all subsequent


levels of organization (secondary, super secondary, tertiary and quaternary) rely
on this primary level of structure. Some proteins are related to one another
leading to varying degrees of similarity in primary sequences. For an example,
myoglobin, an oxygen storage protein found in a wide range of organisms, shows
similarities in human and whale in the 153 residue sequence. Most of the se-
quence is identical and it is easier to spot the differences. When a change occurs
in the primary sequence it frequently involves two closely related residues. For
example, at position 118 the human variant has a lysine residue whilst whale
myoglobin has an arginine residue. Both arginine and lysine are amino acids
that contain a positively charged side chain and this change is called a conserva-
tive transition. In contrast in a few positions there are very different amino acid
residues. Consider position 145 where asparagine (N) is replaced by lysine (K).
This transition is not conservative; the small, polar, side chain of asparagine is
replaced by the larger, charged, lysine. Regions, or residues, that never change
are called invariant.

Figure 4.7: The primary sequences of human and sperm whale myoglobin

In the above picture, the regions in yellow show conserved substitutions


whilst the red regions show non-conservative changes.
42 4. Introduction to Proteomics

4.3.2 Secondary Structure


Primary structure leads to secondary structure when the local conformation of
the polypeptide chain or the spatial relationship of amino acid residues (side
chain) that are closed together by hydrogen bonds. There are three common
shapes: α -helix, β-pleated sheet, and triple helix. All three shapes are very
regular and exist as a result of hydrogen bonds between side chains that occur
at regular intervals along the primary structure.

4.4 Amino Acid Classifications


• Tiny
• Small
• Polar
• Charged

• Positive
• Aromatic
• Aliphatic

• Hydrophibic
• etc...

4.5 Ramachandran Plot


Chapter 5

Some Bioinformatics Model


Organisms

—Farjana Khatun
Some Bioinformatics Model Organism . . . Some Bioinformatics Model
Organism . . . Some Bioinformatics Model Organism . . . Some Bioinformatics
Model Organism . . . Some Bioinformatics Model Organism . . . Some
Bioinformatics Model Organism . . . Some Bioinformatics Model Organism .
. . Some Bioinformatics Model Organism . . . Some Bioinformatics Model
Organism . . . Some Bioinformatics Model Organism . . . Some Bioinformatics
Model Organism . . . Some Bioinformatics Model Organism . . . Some
Bioinformatics Model Organism . . . Some Bioinformatics Model Organism .
. . Some Bioinformatics Model Organism . . . Some Bioinformatics Model
Organism . . . Some Bioinformatics Model Organism . . .

5.1 Origin and Early Evolution


THIS BELOW TEXT HAS NOT YET BEEN CHANGED
FROM THE SOURCE TAKEN
START →
The ancestors of modern bacteria were single-celled microorganisms that were
the first forms of life to appear on Earth, about 4 billion years ago. For about 3
billion years, all organisms were microscopic, and bacteria and archaea were the
dominant forms of life. Although bacterial fossils exist, such as stromatolites,
their lack of distinctive morphology prevents them from being used to examine
the history of bacterial evolution, or to date the time of origin of a particular
bacterial species. However, gene sequences can be used to reconstruct the bac-
terial phylogeny, and these studies indicate that bacteria diverged first from the
archaeal/eukaryotic lineage.

43
44 5. Some Bioinformatics Model Organisms

Bacteria were also involved in the second great evolutionary divergence, that
of the archaea and eukaryotes. Here, eukaryotes resulted from ancient bacte-
ria entering into endosymbiotic associations with the ancestors of eukaryotic
cells, which were themselves possibly related to the Archaea. This involved
the engulfment by proto-eukaryotic cells of alpha-proteobacterial symbionts to
form either mitochondria or hydrogenosomes, which are still found in all known
Eukarya (sometimes in highly reduced form, e.g. in ancient ”amitochondrial”
protozoa). Later on, some eukaryotes that already contained mitochondria also
engulfed cyanobacterial-like organisms. This led to the formation of chloro-
plasts in algae and plants. There are also some algae that originated from even
later endosymbiotic events. Here, eukaryotes engulfed a eukaryotic algae that
developed into a ”second-generation” plastid. This is known as secondary en-
dosymbiosis.

Classification seeks to describe the diversity of bacterial species by naming and


grouping organisms based on similarities. Bacteria can be classified on the basis
of cell structure, cellular metabolism or on differences in cell components such
as DNA, fatty acids, pigments, antigens and quinones. While these schemes
allowed the identification and classification of bacterial strains, it was unclear
whether these differences represented variation between distinct species or be-
tween strains of the same species. This uncertainty was due to the lack of
distinctive structures in most bacteria, as well as lateral gene transfer between
unrelated species. Due to lateral gene transfer, some closely related bacteria
can have very different morphologies and metabolisms. To overcome this uncer-
tainty, modern bacterial classification emphasizes molecular systematics, using
genetic techniques such as guanine cytosine ratio determination, genome-genome
hybridization, as well as sequencing genes that have not undergone extensive
lateral gene transfer, such as the rRNA gene. Classification of bacteria is deter-
mined by publication in the International Journal of Systematic Bacteriology,
and Bergey’s Manual of Systematic Bacteriology. The International Committee
on Systematic Bacteriology (ICSB) maintains international rules for the naming
of bacteria and taxonomic categories and for the ranking of them in the Inter-
national Code of Nomenclature of Bacteria.

Molecular systematics showed prokaryotic life to consist of two separate do-


mains, originally called Eubacteria and Archaebacteria, but now called Bacteria
and Archaea that evolved independently from an ancient common ancestor. The
archaea and eukaryotes are more closely related to each other than either is to
the bacteria. These two domains, along with Eukarya, are the basis of the three-
domain system, which is currently the most widely used classification system in
microbiolology. However, due to the relatively recent introduction of molecular
systematics and a rapid increase in the number of genome sequences that are
available, bacterial classification remains a changing and expanding field. For
example, a few biologists argue that the Archaea and Eukaryotes evolved from
Gram-positive bacteria.
5.2. Virus 45

← END
THIS ABOVE TEXT HAS NOT YET BEEN CHANGED
FROM THE SOURCE TAKEN

5.2 Virus
Discovery of the tobacco mosaic virus by Martinus Beijerinck initiated the jour-
ney of virology in 1898. Virus is a acellular infectious agent that infect all types
of organisms such as archaea, bacteria, plants, animals.

Viruses display a wide diversity of shapes and sizes. Generally viruses are much
smaller than bacteria. Most viruses that have been studied have a diameter
between 10 and 300 nanometres.

Virus particles (known as virions) consist of two or three parts

• Genetic material either DNA or RNA


• Protein coat known as capsid (consist of monomer of capsomer) that pro-
tect the genetic material
• Envelop of lipids, obtained from cell membrane of host cell, surrounds the
capsid when virus are outside of host cell.

Viral populations do not have own metabolism and do not grow through cell
division rather they use the machinery and metabolism of a host cell to produce
multiple copies of themselves, and they assemble in the cell. It is thought that
viruses played a central role in the early evolution, before the diversification of
bacteria, archaea and eukaryotes and at the time of the last universal common
ancestor of life on Earth. Viruses are still one of the largest reservoirs of unex-
plored genetic diversity on the Earth.

Different types of viruses can only infect a limited range of hosts and many
are species-specific. Some, such as smallpox virus can only infect one species-
human, and are said to have a narrow host range. Other viruses, such as rabies
virus, can infect different species of mammals and are said to have a broad range.

Viruses have enormous genomic diversity than plants, animals, archaea and
bacteria. Genomic Diversity of virus can be figured below

Property Parameters Description


46 5. Some Bioinformatics Model Organisms

Nucleic Acid A virus having DNA gene called DNA


• DNA virus and RNA gene called RNA virus.
The vast majority of viruses have RNA
• RNA genomes. Genome replication of DNA
• Both DNA virus and RNA virus takes place in the
and RNA nucleus and cytoplasm of host cell re-
(Different spectively. RNA viruses use their own
stage of life RNA replicase enzymes to create copies
cycle) of their genomes.

Shape Among RNA viruses and certain DNA


viruses, the genome is often divided
• Linear (eg. up into separate parts known as seg-
Adenovirus) mented. For RNA viruses, each seg-
• Circular ment often codes for only one protein
(eg. Poly- and they are usually found together in
omavirus) capsid.

• Segmented

Strand Plant viruses tend to have single-


stranded RNA genomes and bacterio-
• Single phages tend to have double-stranded
• Double DNA genomes.

• Double
strand with
regions of
single strand
(eg. Hepad-
naviridae)
5.2. Virus 47

Size Genome vary greatly between species.


RNA viruses generally have smaller
genome sizes than DNA viruses because
of a higher error-rate when replicating.
• Smallest genome - ssDNA of
cirovirus, code for only two pro-
teins and have a genome size of 2
kb.
• Largest genome present in
mimiviruses, have genome sizes
of over 1.2 megabases and code
for over one thousand proteins

Sense In most viruses with RNA genomes and


some with single-stranded DNA (ss-
• Positive (+) DNA) genomes can be classified on the
• Negative (-) basis of sense. If the genomic strand
is complementary to the mRNA, it is
• Ambisense called positive sense (plus strand). As a
(+/-) result, host cell can translated a part of
genome immediately. On other hand,
negative sense/antisense viral RNA is
not complementary to the mRNA and
must convert to positive sense RNA be-
fore translation by the enzyme, RNA
dependent RNA polymerase.

Bacteriophages are viruses that infect bacteria. Its genes has contribution
in the expression of hosts phenotypes. Bacteria protect themselves from bacte-
riophages by producing enzymes, restriction endonucleases, to destroy the DNA
of bacteriophages by splicing.

5.2.1 Use of Virus in Life Sciences and Medicine


The study and use of viruses have provided valuable information about aspects
of cell biology. For example, viruses have been useful in the study of genetics and
helped in understanding of the basic mechanisms of molecular genetics such as
DNA replication, transcription, RNA processing, translation, protein transport,
and immunology. It is used as vectors to introduce genes into cells. In similar
fashion, virotherapy uses viruses as vectors to treat various diseases, as they
can specifically target cells and DNA. It shows promising use in the treatment
of cancer and in gene therapy.
48 5. Some Bioinformatics Model Organisms

5.2.2 Use of Virus in Materials Science and Nanotechnol-


ogy
Current trends in nanotechnology promise to make much more versatile use of
viruses. From the viewpoint of a materials scientist, viruses can be regarded
as organic nanoparticles. Their surface carries specific tools designed to cross
the barriers of their host cells. The size and shape of viruses, and the number
and nature of the functional groups on their surface, is precisely defined. As
such, viruses are commonly used in materials science. The powerful techniques
developed by life sciences are becoming the basis of engineering approaches to-
wards nanomaterials, opening a wide range of applications far beyond biology
and medicine.

Plant virus particles or virus-like particles (VLPs) have applications in both


biotechnology and nanotechnology because of having simple and robust struc-
tures of capsids and can be produced in large quantities. Plant virus particles
can be modified genetically and chemically to encapsulate foreign material and
can be incorporated into supramolecular structures for use in biotechnology.

Currently, the full-length genome sequences of 2408 different viruses (includ-


ing smallpox) are publicly available at an online database, maintained by the
National Institute of Health.

5.3 Bacteria
Microbiologist Antonie van Leeuwenhoek in 1676 first observed bacteria by his
own designed single-lens microscope. He then called it ”animalcules.” In 1838,
Christian Gottfried Ehrenberg introduced it as bacterium. Bacteria are mi-
croscopic, single celled prokaryotes surrounded by cell membrane made of lipid.
They have neither membrane bounded nucleus nor organelles like mitochondria,
chloroplasts, Golgi apparatus, endoplasmic reticulum. Bacterial cells contain
micro-compartments named carboxysome enclosed by protein shells that helps
it in metabolism. Bacterial cell walls are made of peptidoglycan (protein and
carbohydrate). There are two types of bacteria on the basis of structure of cell
wall named Gram-positive and Gram-negative bacteria. These names originate
from the reaction of the cells to the Gram stain (crystal violet, safranin).

• Gram positive bacteria possess a thick cell wall, made of many layers of
peptidoglycan and trichoic acids and can retain the Gram stain even after
washing with alcohol or acetone.

• Gram negative bacteria have relatively thin cell wall consisting few layers
of peptidoglycan and surrounded by lipid membrane (lipopolysaccharides
and lipoproteins) and unable to retain stain after washing with alcohol or
acetone.
5.3. Bacteria 49

Figure 5.1: Bacteria

Bacteria are a few micrometers in length, having a wide range of shapes


(spheres, rods, spirals). They are founded in soil, water, air, radioactive waste,
deep crust of the earth, organic materials as well as the live bodies of plants and
animals in the different size, shapes and stains. Human being bears different
types of bacteria in different parts of the body, but large numbers of bacteria
are present on skin and gut flora. Majority of the bacteria in the body are
harmless due to the protective effect of the immune system. A few bacteria are
beneficial (For example, the presence of over 1,000 bacterial species in the nor-
mal human gut flora of the intestines provide gut immunity, synthesis vitamins
such as folic acid, vitamin K and biotin, convert milk protein to lactic acid as
well as fermenting complex undigestible carbohydrates. The presence of this
gut flora also inhibits the growth of potentially pathogenic bacteria) and a few
are pathogenic and cause infectious diseases (cholera, anthrax, leprosy, syphilis,
plague).

Genetic material of bacteria is typically a single circular chromosome dispersed


in the cytoplasm in an irregular shape known as nucleoid. Chromosomes, are
made with proteins and RNA/DNA, can range in size from 160,000 bp to
12,200,000 bp (soil dwelling bacteria). The genes in bacterial genomes are usu-
ally a single continuous stretch of DNA and several different types of introns
50 5. Some Bioinformatics Model Organisms

exist in bacteria which is rare in eukaryotes. Bacteria may also contain plasmids
which are small extra-chromosomal DNAs that may contain genes for antibiotic
resistance.

Bacteria, asexual organisms (cell division occurs through amitosis or binary


fission process. Under optimal conditions, bacteria can grow and divide rapidly,
and bacterial populations can double as quickly as every 9.8 minutes), inherit
identical copies of their parent’s genes. However, all bacteria can evolve by selec-
tion on changes to their genetic material DNA caused by genetic recombination
or mutations. Mutations come from errors made during the replication of DNA
or from exposure to mutagens (agent that cause mutation). Mutation rates vary
widely among different species of bacteria. Genetic changes in bacterial genomes
come from either random mutation during replication or ”stress-directed mu-
tation”, where genes involved in a particular growth-limiting process have an
increased mutation rate.

Some bacteria also transfer genetic material between cells. This can occur in
three main ways.

• Transformation: bacteria takes exogenous DNA from their environment

• Transduction: Integration of a bacteriophase introduces foreign DNA into


chromosome

• Bacterial conjugation: DNA is transferred through direct cell contact

These types of gene acquisition are known as horizontal gene transfer and
most common in nature. Due to gene transfer, it is difficult to determine origi-
nal sequences of bacteria. For an example : to determine the genome sequence
of Mycoplasma genitalium, scientists of J. Craig Venter Institute systemically
destroy its gene (mutating by inseration) one by one to observe which are es-
sential to life and which are dispensable. Finally they have concluded that only
381 protein-encoding genes are essential to life out of 485.

Bacterial growth, stains, morphology can be assayed by culture techniques


(sapmles such as sputum, stool, blood, urine, spinal fluid are cultured on se-
lective media). This techniques are designed to promote the growth and iden-
tify the particular bacteria. Diagnostics using such DNA-based tools, such as
polymerase chain reaction, are increasingly popular due to their specificity and
speed, compared to culture-based methods. These methods also allow the detec-
tion and identification of ”viable but nonculturable” cells that are metabolically
active but non-dividing. However, even using these improved methods, the total
number of bacterial species is not known and cannot even be estimated with
any certainty. Following present classification, there are fewer than 9,000 known
species of bacteria (including cyanobacteria), but attempts to estimate the true
level of bacterial diversity have ranged from 107 to 109.
5.4. Escherichia coli 51

5.3.1 Importance of Bacteria in Bioinformatics


Due to the ability of rapid growth and ease of manipulation, bacteria are the
most important area of study for the field of molecular biology, genetics, bio-
chemistry as well as bioinformatics. By making mutations in bacterial DNA and
examining the results, scientists can determine the function of genes, enzymes
and metabolic pathways in bacteria, and then the hypotheses are applied to
more complex organisms. The enzyme kinetics of a cell, mathematical mod-
els of gene expression data of entire organism etc. can be achieved using well
studied models of bacteria.
THIS BELOW TEXT HAS NOT YET BEEN CHANGED
FROM THE SOURCE TAKEN
START →
By combining nanotechnologies with landscape ecology complex habitat land-
scapes can be generated with details at the nanoscale. On such synthetic ecosys-
tems evolutionary experiments with E. coli have been performed in order to
study the spatial biophysics of adaptation in an island biogeography on-chip.
← END
THIS ABOVE TEXT HAS NOT YET BEEN CHANGED
FROM THE SOURCE TAKEN

5.4 Escherichia coli


E. coli is most thoroughly studied bacteria among all creatures (except human
beings). E. coli K12 is one of the first organisms whose entire genome has been
sequenced and published in 1997. Its 4,639, 221 base pairs of DNA encoding
4,377 genes. Its usually harmless, live in the human colon. Water and under-
cooked food contaminated with 0157:H7 strain cause severe, sometimes fatal
infection. It is used as a model organism as cultivated strains (eg. E. coli K12)
are well adapted to the laboratory environment.

The use of plasmids and restriction enzymes in E. coli to produce recombinant


DNA by Stanley Norman Cohen and Herbert Boyer is the foundation of biotech-
nology and biological engineering. Production of therapeutic proteins (insulin,
growth factor, antibodies, enzymes) is the useful applications of recombinant
DNA.

5.5 Archaea
Another group of prokaryotes, archaea, meet these criteria but differ from bac-
teria on the basis evolutionary history. A major step forward in the study of
bacteria was the recognition in 1977 by Carl Woese that archaea have a separate
line of evolutionary descent from bacteria. This new phylogenetic taxonomy was
52 5. Some Bioinformatics Model Organisms

based on the sequencing of 16S ribosomal RNA, and divided prokaryotes into
two evolutionary domains, bacteria and Archaea, as part of the three-domain
system (bacteria, archaea and eukaryotic cells)

5.6 Fungi
Fungi are eukaryotic organisms that include microorganisms such as yeasts,
mushrooms and molds. It is classified as a kingdom which is separated from
plants, animals and bacteria. The study of fungi is called mycology which is
often regarded as a branch of botany but genetic studies have shown that fungi
are closely related to animals than to plants.

Advances in molecular genetics have opened the way for DNA analysis to be in-
corporated into taxonomy, which has sometimes challenged the historical group-
ings based on morphology and other traits. Phylogenetic studies published in
the last decade have helped reshape the classification of Kingdom Fungi, which
is divided into one subkingdom, seven phyla, and ten subphyla.

Several pivotal discoveries in biology have made possible by researchers using


fungi as model organisms (fungi that grow and sexually reproduce rapidly in
the laboratory).

• Using mold Neurospora crassa, one gene-one enzyme hypothesis was for-
mulated to test their biochemical theories.

• Cell cycle regulation, chromatin structure and gene regulation in eukary-


otic cell biology and genetics have been studied using Aspergillus nidulans,
Saccharomyces cerevisiae, Schizosaccharomyces pombe.

• Candida albicans - an opportunistic human pathogen, Magnaporthe grisea


- plant pathogen and Pichia pastoris - yeast widely used for eukaryotic
protein expression.

Using fungi as model organism, specific biological problems are trying to


solve relevant to medicine, plant pathology and industrial uses.

5.7 Human Being


There are different types of cells in the human body. Each cell remaining under
specific tissues and organs perform a specific function. There are four types of
tissue in the human body:

1. Epithelial Tissue - The cells of epithelial tissue pack tightly together


and form continuous sheets that serve as linings in different parts of the
body. Epithelial tissues serve as membranes lining organs and helping
to keep the body’s organs separate, in place and protected. Examples of
5.7. Human Being 53

epithelial tissue are the outer layer of the skin, the inside of the mouth
and stomach, and the tissue surrounding the body’s organs.

2. Connective Tissue - There are many types of connective tissue in the


body. Connective tissue adds support and structure to the body. Most
types of connective tissue contain fibrous strands of the protein collagen
that add strength to connective tissue. For example: inner layers of skin,
tendons, ligaments, cartilage, bone and fat tissue. Blood is also considered
a form of connective tissue.

3. Muscle Tissue - Muscle tissue is a specialized tissue that can contract.


Muscle tissue contains the specialized proteins actin and myosin. Exam-
ples of muscle tissue are contained in the muscles throughout your body.

4. Nerve Tissue - Nerve tissue contains two types of cells: neurons and
glial cells. Nerve tissue has the ability to generate and conduct electrical
signals in the body. These electrical messages are managed by nerve tissue
in the brain and transmitted down the spinal cord to the body.

Organ systems are composed of two or more different organs. There are 10
major organ systems in the human body, they are

• Skeletal System: The main role of the skeletal system is to provide


support for the body, to protect delicate internal organs and to provide
attachment sites for the organs. Major organs are Bone, cartilage, tendon,
ligaments etc.

• Muscular System: The main role of the muscular system is to pro-


vide movement. Muscles work in pairs to move limbs and provide the
organism with mobility. Muscles also control the movement of materials
through some organs, such as the stomach and intestine, and the heart
and circulatory system.

• Circulatory System: The main role of the circulatory system is to


transport nutrients, gases (such as oxygen and CO2), hormones and wastes
through the body. Major organs are heart, blood vessels and blood etc.

• Nervous System: The main role of the nervous system is to relay elec-
trical signals through the body. The nervous system directs behaviour
and movement and, along with the endocrine system, controls physiolog-
ical processes such as digestion, circulation, etc. Major organ are Brain,
spinal cord and peripheral nerves.

• Respiratory System: The main role of the respiratory system is to


provide gas exchange between the blood and the environment. Primarily,
oxygen is absorbed from the atmosphere into the body and carbon dioxide
is expelled from the body. Major organ are Nose, trachea and lungs
54 5. Some Bioinformatics Model Organisms

• Digestive System: The main role of the digestive system is to breakdown


and absorbs nutrients that are necessary for growth and maintenance.
Major organs are mouth, esophagus, stomach, small and large intestines
etc.
• Excretory System: The main role of the excretory system is to filter out
cellular wastes, toxins and excess water or nutrients from the circulatory
system. Major organs are kidneys, ureters, bladder and urethra
• Endocrine System: The main role of the endocrine system is to relay
chemical messages through the body. In conjunction with the nervous sys-
tem, these chemical messages help control physiological processes such as
nutrient absorption, growth, etc. Many glands exist in the body that se-
crete endocrine hormones. Among these are the hypothalamus, pituitary,
thyroid, pancreas and adrenal glands.
• Reproductive System: The main role of the reproductive system is to
manufacture cells that allow reproduction. In the male, sperm are created
to inseminate egg cells produced in the female. Major organs in case of
female: ovaries, oviducts, uterus, vagina and mammary glands and Male:
testes, seminal vesicles and penis.
• Lymphatic/Immune System: The main role of the immune system
is to destroy and remove invading microbes and viruses from the body.
The lymphatic system also removes fat and excess fluids from the blood.
Lymph, lymph nodes and vessels, white blood cells, T- and B- cells etc.
form the immune system of the body.
Chapter 6

Computing Fundamentals
for Bioinformatics

—Zohirul Alam Tiemoon


Bioinformatics would not be possible without taking the advantages of com-
puter hardware, software and World Wide Web (WWW). Very fast and high
capacity storage media helps to store large volume of biological data. Software
is used to retrieve information from the storage and analyze to discover mystery
inside these data.

Here, in this chapter we will go through computing fundamentals which are


required for a learner to know bioinformatics. It covers algorithm, data struc-
ture, programming concept, computational model, database, World Wide Web
(WWW) and web services. All of the discussions here will be in a very basic
level. Through the whole book, we will find the details of each topic.

6.1 Bioinformatics Problem Solving and Algo-


rithm Development
To solve a problem in computer science (software), therere some steps to be
done:

1. Understand the problem precisely.

2. Divide the problem in small pieces. Do some informal pen-paper base


activities like drawing circles, rectangles, lines etc to analyze the problem.
Write pseudocode (pen-paper base) which is a set of steps about how to
implement logic inside computer.

55
56 6. Computing Fundamentals for Bioinformatics

3. Finally write code in any programming language (Python, C/C++,Java,


Perl).

Figure 6.1: Problem Solving Strategy

Algorithm is actually a set of instructions organized by human (computer


people) for solving a particular problem in an efficient and accurate way. We
find that steps 2 includes algorithm activities. An algorithm is expressed usually
by pseudocode or flowdiagram.

Suppose, you request a programmer to write a program by which


you will find the number of Adenine (A)s in a DNA sequence.

For the above problem, the pseudocode is the following cryptic lines: (If you
are from biology background, dont worry. We will go through each line of it.)

Algorithm 1 GetNoOfAdenine(DNA Sequence)


1: N oOf Adenine ← 0
2: for index ← 0 to (DN A Sequence.Length − 1) do
3: if DN A Sequence[index] = A then
4: N oOf Adenine ← N oOf Adenine + 1
5: end if
6: index ← index + 1
7: end for
8: return N oOf Adenine

6.1.1 Why Do We Need Algorithm?


For solving a mathematical or biological (which can be modeled mathematically)
problem a computer runs enormous instruction inside it. A programmer uses a
high level programming language to instruct computer to carry out the solutions
of a problem. Writing a computer program requires to know the details of this
language, compiler and basic architecture of computer. Just after getting the
problem, if the programmer jumps to work with computer then he or she may
not get the accurate and efficient solution. Sometimes, its quite impossible to
do it if the problem is inherently complex. So, we need to design algorithm
before writing the computer program for a problem to be solved.
6.1. Bioinformatics Problem Solving and Algorithm Development 57

6.1.2 How to Design an Algorithm?


Designing an algorithm requires some knowledge about pseucode and how to
organize (steps of these) these to get the accurate output with minimum time.
Before knowing in details, we can try to understand our example for finding
Adenine in a DNA sequence.

There are ten lines in our pseudocode. GetNoOfAdenine(DNA Sequence) is


a subroutine and all the remaining instructions are group together inside this
subroutine. In future, you will find that a pseudocode may comprise more than
one subroutine.

Line number 1 is N oOf Adenine ← 0. Here N oOf Adenine is a variable, means


its value can be changed. The whole line means that Zero (0) is assign to
N oOf Adenine. So now its present value is zero. By the name of this variable
you will have some idea about it and why we will initialize it by zero. Yes, it is
representing the number of Adenines and initially is it zero.

Before explaining the next instructions I would like to describe about how a
computer treats a DNA sequence. Consider a DNA sequence like ACTCACG-
TAG as a sample input.

Figure 6.2: DNA Sequecne

Computer finds it as a sequence of keyboard character. Positional value (in


computer science its called index) of first character is zero and for each next
characters (in our case, it is nucleotide) it increases by one. Finally, we see that
our DNA sequence which consists of 10 nucleotides is mapped from 0 to 9 index
positions. If we want to know the 6th nucleotide of our DNA sequence we will
have it by DN A Sequence[5].

Now come back to our main discussion. If we look at line number 2 to 7,


we will find that line number 3,4, 5, 6, and 7 are group together inside line
number 2, for index ← 0 to (DN A Sequence.Length − 1)

To repeat the task of line number 3, 4, 5, 6 and 7 we use for in line 2. Now I
am explaining line number 2:

You will find index, a variable just after for and its initial value is zero. The
whole line means that the inside instructions will be repeated until index value
reaches DN A Sequence.Length − 1. For our considering DNA sequence ACT-
CACGTAG, it will stop when index reaches after 9(= 10 − 1) that means when
58 6. Computing Fundamentals for Bioinformatics

index will be 10 the inside instruction will not be executed. But how does the
index value increase? See line number 6, index ← index + 1. Each repetition it
increases by one.

Line number 3 and 4 are straight forward. These mean if the present nucleotide
equals to A (Adenine), N oOf Adenine will be increased by one. Be careful, line
6 is not under the condition of line 3.

Last line means that subroutine returns N oOf Adenine.

Now consider the problem for the input DN A Sequence, ACTCACGTAG.

So, now GetNoOfAdenine knows DN A Sequence is ACTCACGTAG. Then

Simulation

Step 1: N oOf Adenine = 0

Step 2: As index value is 0 line number 3 will be executed.

Step 3: index value is zero in line number 3, So, DN A Sequence[0] is A.


That means DN A Sequence[0] = A. So the condition in line 3, is true. As line
4 is under line 3, line 4 will be executed if line 3 is true.

Step 4: Line 4 will be executed and noOfAdenine will be 1 (0+1)

Step 5: Line 6 will be executed and index value will be 1

Step 6: As we reach the last line of for-grouped lines, program will go to


line 2. As index is 1 and still it doesnt reach 10 so inner instructions of for will
be started again.

Step 7: Now index value is 1 so DN A Sequence[1] is C. So, the condition


in line 3 is not true (C is not equal to A). As a result line 4 will be skipped. So,
after line 4, noOf Adenine remains same.

Step 8: Line 6 will be executed and index value will be 2

Step 9: Again program will go to line 2 and find that index value is 2 so
for will repeat.
In this way, program executes until index value reaches 9. We again want
to see what is happening in some of last steps.

Suppose, now we are just before line 6 and index value is 8. So, after line
6.2. Data Structure 59

6, index value will be 9. Program will go to line 2 and find that still index
value is 9 which less than 10 so repeat. In line 3, condition will be false as
DN A Sequence[9] is G and line 4 will be skipped. In line 6 index value will
be 10 and again program will go to line 2 and find that index reaches 10 so for
will not be repeated and program enters in line 8 where it returns noOf Adenine.

For our problem noOf Adenine will be 3.

If you understand up to this you have completed a great journey towards algo-
rithm.

6.1.3 How to Write Pseudocode


6.1.4 Types of Algorithm

6.2 Data Structure


6.3 Concept and Usage of Database
6.4 Computational Model
6.5 Programming Concept and Applications
6.6 World Wide Web (WWW)
6.7 Web Service
Rough Top-View of Contents-
• Algorithm (performance etc)

• Data Structure (graph, tree etc)


60 6. Computing Fundamentals for Bioinformatics
Chapter 7

Math Primer for


Bioinformatics

—Zohirul Alam Tiemoon

61
62 7. Math Primer for Bioinformatics
Chapter 8

Biological Processes,
Experimental Methods &
Machinery

—Farjana Khatun

8.1 DNA Cloning


8.2 DNA Sequencing
8.3 Gel electrophoresis
8.4 DNA Cloning in Plasmid Vector
8.5 Sanger Method for DNA Sequencing
8.6 DNA Shotgun Sequencing
8.7 DNA Microarray
8.8 Recombinant DNA Technology
8.9 Constructing Genomic and cDNA Libraries

63
64 8. Biological Processes, Experimental Methods & Machinery
Part II

Introduction to
Bioinformatics Problems

65
67

Introduction to Bioinformatics Problems Introduction to Bioinformatics Prob-


lems Introduction to Bioinformatics Problems Introduction to Bioinformatics
Problems Introduction to Bioinformatics Problems Introduction to Bioinfor-
matics Problems ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... Introduction to Bioinformatics Problems Intro-
duction to Bioinformatics Problems Introduction to Bioinformatics Problems
Introduction to Bioinformatics Problems Introduction to Bioinformatics Prob-
lems Introduction to Bioinformatics Problems Introduction to Bioinformatics
Problems Introduction to Bioinformatics Problems
68
Chapter 9

DNA & Protein Sequencing

—Saddam Hossain
DNA & Protein Sequencing . . . DNA & Protein Sequencing . . . DNA &
Protein Sequencing . . . DNA & Protein Sequencing . . . DNA & Protein
Sequencing . . . DNA & Protein Sequencing . . . DNA & Protein Sequencing
. . . DNA & Protein Sequencing . . . DNA & Protein Sequencing . . . DNA
& Protein Sequencing . . . DNA & Protein Sequencing . . . DNA & Protein
Sequencing . . . DNA & Protein Sequencing . . . DNA & Protein Sequencing
. . . DNA & Protein Sequencing . . . DNA & Protein Sequencing . . . DNA
& Protein Sequencing . . . DNA & Protein Sequencing . . . DNA & Protein
Sequencing . . . DNA & Protein Sequencing . . .

How we obtain the sequence of nucleotides of a species.

9.1 DNA Sequencing


Single mother of all humans is Eve and Single father of all humans is Adam.
- how we could be able explore this hypothesis if we don’t know the code of
human being that is encoded in the DNA and suffers from evolutionary changes
in the form of mutations. DNA is the representative of a species. Now a days
it has become very routine job for any bio molecular lab to try to read some
DNA to find out its code, which is called the DNA sequence[1].

DNA Sequencing: DNA Sequencing is the process of determining the com-


plete ordered sequence of nucleotides (A, T, C, G’s) of a complete or
partial DNA of an organism.

Polio Virus DNA Sequence

Part of Human Genome Sequence

69
70 9. DNA & Protein Sequencing

Figure 9.1: DNA Sequencing

Graph for DNA Sequencing Cost over Years

9.2 History of DNA Sequencing


Prior to mid-1970’s there was no direct method to sequence Eukaryotic DNA.
During that period, all scientists did this based on their knowledge and experi-
ence, and rendering some Reverse Genetics, in which the amino acid sequence
of the gene product of interest is back-translated into a nucleotide sequence
based upon the appropriate codons. Given the degeneracy of the genetic code,
this process can be tricky at best but not the accurate or direct one. Around
mid-1970’s, scientists invented some direct methods for DNA sequencing and
the revolution started. Though DNAs of many viruses and prokaryotics have
been sequenced in the early age, the first eukaryotic free-living organism to be
DNA-Sequenced is claimed by Fred Blattner in 1997, and the organism is E.coli.
But there is a debate, it is also claimed that Hemophilus influenzae has been
sequenced first. With the speeding automated sequencing mechanisms and com-
putational models, Human Genome Project has made the ever-best landmark in
the field of bioinformatics by sequencing the complete human genome in 2003.
And about 1, 000 more eukaryotic genomes are currently in production.

9.3 Methods of DNA Sequencing


The major challenge of DNA sequencing is that there is no machine invented
that can sequence a long DNA strand at a stretch. All the machines and pro-
9.4. DNA Sequencing Process 71

cesses can sequence 500 − 5, 000 bases at a time. This is really very small in size
with respect to a complete genome of any eukaryotic.

The very first methods of DNA sequencing are Maxam - Gilbert Method and
Sanger Method. Maxam - Gilbert Method was developed by Allan Maxam
and Walter Gilbert in 1977, which was a chemical cleavage method. And Fred
Sanger devised at around the same time which is a dideoxynucleotide chain
termination method, known as emphSanger Method [Chapter 08]. As Maxam
- Gilbert Method is an old and predominant method but not currently used
frequently, it is not described in detail in this book, however one of its advan-
tage is that it permits direct sequencing of small fragments. Sanger method is
the commonly used one and even the Human Genome Project was done based
on Sanger method and Shotgun Sequencing. The Sanger method is carried out
using Gel electrophoresis. These were the two competing methods of determin-
ing DNA sequence since the old days of bioinformatics. The Sanger method is
commonly of two types - i) Manual Sanger Method and ii) Automated Sanger
Method. Shotgun sequencing takes maximum advantage of the speed and low
cost of automated sequencing, but relies totally on software to assembly a jum-
ble of sequence reads into a coherent and accurate contig. But there are lots of
success stories for shotgun sequencing, that is why this one of the most favorite
sequencing methods till date. The Institute for Genomic Research (TIGR)
has demonstrated the power and utility of the shotgun approach by determin-
ing the complete genomic sequences of Haemophilus influenzae, Methanococcus
jannaschii, and Mycoplasma genitalium.

The technology of DNA labeling has changed in the last fifteen years, so that
there are many more options. In this modern era there are some more automated
method such as Automated Fluorescence Sequencing or Radioactivity-based Dye
Termination Sequencing methods, which are more versatile of sequencing with
much more throughput than the previous methods. There are some other meth-
ods like Cycle Sequencing, Capillary Electrophoresis, etc. More new and promis-
ing technologies are Computational Fragment Assembly, Pyrosequencing, Single
Molecule Methods. Eventually all these techniques and methods provide the
order of the nucleotides in a given DNA.

9.4 DNA Sequencing Process


DNA sequencing is a very long process. It encompasses the following steps of
task to result out a sequenced DNA fragment. The following steps are main-
tained in several bioinformatics industry for DNA sequencing process.

Extraction of Genomic DNA: The first step is to extract the high quality
DNA from the organism to sequence. Different kits and protocols are available
to extract clean and efficient genomic DNA from the respective organism.
72 9. DNA & Protein Sequencing

Genome Mapping: The genome map is the must pre-requisite for DNA se-
quencing task. From the genome map the span region of genome to be sequenced
is identify first. Identity set of clones from this region is selected, these are the
mapped clones. Then the amplification for this gene region is done.

Library Creation: From the selected mapped clones, through cloning, sets
of smaller clones are made. This pool of clones act as a clone library for further
sequencing work.

Template Preperation: In this stage DNA is purified from smaller clones.


And the wet-lab set up for sequencing chemistries (using any of above discussed
methods) is done.

Gel Electrophoresis: The sequences from the smaller clones are determined
here using gel electrophoresis.

Pre-finishing: Some special techniques of sequencing are used to produce


high quality sequences. This is very crucial step as because DNA sequence
clean up is done here.

Finishing: This is the stage where the final product of sequenced DNA is
achieved. This sequenced DNA is now ready to process as DNA sequece data
for futher use.

Data Editing / DNA Annotation: To make the sequenced DNA avaiable


to the next bioinformatics researches, it is need to store in a library or genome
bank. Before the submission to public databases, some steps of quality assur-
ance, verification and biological annotation are needed.

Genome Template
Library Gel Elec-
Mapping Preper-
Creation trophoresis
ation

Data
Editing Pre-
Finishing
/ DNA finishing
Annotatin

Figure 9.2: DNA Sequencing Process


9.5. DNA Sequencing in Real Time 73

COLLECT PICTURE FOR DNA SEQUENCING LAB


IN BANGLADESH IF POSSIBLE

9.5 DNA Sequencing in Real Time


Think, what if, it were possible that you have come to a doctor and the doctor
has prescribed you to do some diagonostic tests for your blood alogn with an-
other test for finding DNA sequence of a specific region of your genome. The
doctor may need the sequence information for that region of genome to infer
a conclusion about the presence or absence of a DNA pattern in that region.
There would come some day, when sequencing a fragment of DNA would be
almost a routine work in the clinical practices. To enable this kind of situation
the primary need will be high speed, ultra-fast DNA sequencing mechanisms.
Several bioinformatics research institutes, academics, and industries are work-
ing to find a solution to this end. This future techniques of DNA sequencing
must differ substantially from the current and probable next-generation meth-
ods of sequencing with very high throughput. Such one technique for sequencing
which may open the future door, is the nanopore sequencing approach. In this
approach the nucleic acids are driven through a nanopore (either a biological
membrane protein such as alpha-hemolysin or a synthetic pore). Fluctuations
in DNA conductance through the pore, or, potentially, the detection of interac-
tions of individual bases with the pore, are used to infer the nucleotide sequence.
Although progress has been made in achieving early proof-of-concept demon-
strations with such methods major technical challenges remain along the path
to a truly practical nanopore-based sequencing platform.

9.6 Next Generation DNA Sequencing


The expected and proposed next generation DNA sequencing must have the
property of sequencing thousands of individual mini-sequencing reaction on a
single plate. So that millions of base pairs can be sequenced on a single run.
Sequence capture arrays will become available to focus sequencing on genes
of interest. This will help in comprehensive analysis of gene in full length.
Obviously this would be expensive in terms of instruments and reagents.

9.7 Complete Genome Sequencing


The complete genome sequencing refers to sequencing the DNA of an organ-
ism end-to-end. The whole genome sequencing strategies are based on Genome
Mapping (i.e: Physical Mapping) , Primer Walking, Shotgun Sequencing etc.
Genome map is the primary tool to start or plan for a complete DNA sequenc-
ing process. As no machine can sequence a large (> 5000bp) at a stretch, the
principle (as per today) of complete genome sequencing is - break the genome
74 9. DNA & Protein Sequencing

into smaller pieces, sequence the smaller DNA fragments and reconstract the
complete sequenced DNA from these sequenced fragments. Based on genome
map, the genome is fragmented. Then the fragments are cloned to build a
clone-library. These DNA fragments are sequenced first (based on the previ-
ously discussed methodologies). After that, computational models, software
packages are used to identify overlapping clones with common restriction frag-
ments and assembles them into a contig. These contigs are edited and aligned
to the genome map. Gaps between clones are filled with other clones (such
as fosmids) in this step, or by generating PCR products from BAC clones or
genomic DNA. Contigs are assembled into the complete genome in this way.

FIGURE FOR CLONE TO CONTIG TO GENOME

Based on the principles discussed above, there are three major strategies for
complete genome sequencing. i) Hierarchical or Clone-by-Clone, where the
genome is broken into many long pieces. Each long piece is then mapped onto
the genome. And each piece is sequenced with shotgun. This strategy is applied
to Yeast, Worm, Human, Rat etc. ii) Walking, which is the online version of
(i). Here the genome is broken into many long pieces and each piece is started
to be sequenced with shotgun, then construction of map is done. Rice genome
has been sequenced in this manner. iii) Whole genome shotgun, in which one
large shotgun pass on the whole genome. Genome from many organism like
Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu etc have been se-
quenced using this strategy.

FIGURE FOR STRATEGIES

9.8 Challenges of DNA Sequencing


The main challenge of DNA sequencing is that there is no machine that takes
long DNA as an input, and gives the complete sequence as output The avaiable
methods can only sequence around 500 letters (base-pairs) at a time. Increase
of sensitivity of current instruments (in terms of sequence length) is essential.
In the chemistry lab, There is a need for additional fluor combinations to enable
reaction multiplexing, which can save time and money. Lowering the cost of se-
quencing in another challange ahead of us, along with increasing the throughput.
In the history, most cost decreases have been incremental, rather than monu-
mental. But, in this case there is needed large cost decreases, which may require
some revolutionary approaches on this. Making the application and related in-
struments available one of the concerns in DNA Sequencing. An statistics tells
that current set-ups (laboratory standards) for DNA sequening (i.e.: 3100 Ge-
netic Analyzer) on an average can sequence around 100 samples in one day.
16-20 samples make up one run, 6-10 runs in a plate, and 2 plates at once. So
9.9. Usage of DNA Sequencing 75

the average capability for daily sequencing is about 200 samples. This is very
low for sequencing throughput to support the current and up-coming demand
for sequenced DNA. A well maintained machine is also vital to a successful
sequence.

9.9 Usage of DNA Sequencing


There are about 100 million species. And each individual has different DNA.
Even within individual, some cells have different DNA (i.e. cancer). How many
sequences are there? Really they are needed to be sequenced to study before
having the study of the population from any direction. If we want to explore
what genes are on/off when and in which cell, we need to know the sequence
first. Where do molecules bind to DNA? - this study also needs sequenced DNA.
The study of Single Nucleotide Polymorphism (SNP) will not be successful
except we have correctly sequenced DNA, which represents the correct order of
nucleotides on the DNA strand. To solve the ”mystry” of DNA sequence, DNA
needs to be sequenced first. Sequenced DNA can be used in characterizing the
genetic differences between affected and unaffected individuals, also between
diseased and normal cells. To develop diagnostic/prognostic assays for disease,
to have sequenced DNA in hand is the primary need. The followings are some
summarized application of sequenced DNA.
• Designing Preventive Medicine: The sequenced DNA of individ-
ual human genome can be used as a component for designing preventive
medicine.
• Genetic Hypothesis Testing: Rapid hypothesis testing for genotype-
phenotype associations.
• Gene-Expression Profiling: Application of DNA sequencing is neces-
sary for in vitro and in-situ gene expression profiling at all stages in the
development of a multicellular organism.
• Cancer Research: DNA sequencing is needed in cancer research for
example, determining comprehensive mutation sets for individual clones.
• Pathogen Identification: Identification of known and new pathogens,
and development of biowarfare sensors.
• Detail Genome Annotation: Sequenced DNA is the key for detail
annotated genome.
• Study of Evolution: Evolution can be studied, in the detail level, even
with the explanation of single nucleotide polymorphism (SNP) and other
types of mutations and speciations using correctly sequenced DNA.
• etc.
If expressed in smallest sentence, DNA sequencing is the door-way to the
eternal infinity of bioinformatics.
76 9. DNA & Protein Sequencing

9.10 DNA Sequencing: Where to Next


In everyday life, the importance and impact of DNA sequencing is just growing
and its usability to our daily life will demand efficient, cost effective and fastest
methods for DNA sequencing. Population study and Personalized drugs are two
future needs of DNA sequencing. Roughly it is expected that by 2020 we will be
able to completely sequence a million individuals. Future variants of sequencing
may be like Resequencing of humans, Microbial and environmental sequencing,
Cancer genome sequencing, etc.

Opportunities for discovery are virtually endless, from complex diseases to pa-
leogenomics and museomics (analysis of ancient DNA), from searching for new
organisms in the deep ocean and volcanoes to manipulating valuable traits in
livestock and molecular plant breeding. This is where the challenges as well as
major opportunities lie in the future.

9.11 Case Study: Human Genome Project


The Human Genome Project, a large, federally funded collaborative project
completed by 2003. This was a project of $3 billion to sequence human genome
of 3 billion nucleotides. The project was developed from an idea discussed at
scientific meetings in 1984 and 1985, and a pilot project, the Human Genome
Initiative, was begun by the Department of Energy (DOE) in 1986. National
Institutes of Health funding of the project began in 1987 under the Office of
Genome Research. Then the project is constituted as the National Human
Genome Research Initiative. In 1988, a new commercial venture under the
leadership of Craig Venter was formed to sequence the majority of the hu-
man genomes and intensive computer processing of data, has already completed
the Drosophila sequence and mouse genome. Both groups simultaneously an-
nounced completion of the sequencing of the human genome on 2003. Officially
it tool the time line of 1990 to 2003. Largest ever individual project in the re-
search of bioinformatics. Though first time human genome had been sequenced
with a cost of $3 billion and labour of around 13 years, the cost came down to
$20-30 million during 2005 with a time line of 6 months to sequence. Already
the cost has come down to about $50000-100000. But still this is a very high
price to sequence a human genome.

9.12 Protein Sequencing


Protein Sequencing is to determine the amino acid sequence of a protein.
Discovering the structures and functions of proteins in living organisms is an
important tool for understanding cellular processes, and allows drugs that target
specific metabolic pathways to be invented more easily. The two major direct
methods of protein sequencing are Mass Spectrometry and the Edman Degra-
dation Reaction. It is also possible to generate an amino acid sequence from
9.12. Protein Sequencing 77

the DNA or mRNA sequence encoding the protein. The Edman degradation is
a very important reaction for protein sequencing, because it allows the ordered
amino acid composition of a protein to be discovered. Automated Edman se-
quencers are now in widespread use, and are able to sequence peptides up to
approximately 50 amino acids long.

The other major direct method by which the sequence of a protein can be
determined is Mass Spectrometry. This method has been gaining popularity in
recent years as new techniques and increasing computing power have facilitated
it. Mass spectrometry can, in principle, sequence any size of protein, but the
problem becomes computationally more difficult as the size increases. Peptides
are also easier to prepare for mass spectrometry than whole proteins, because
they are more soluble. One method of delivering the peptides to the spectrome-
ter is electrospray ionization, for which John Bennett Fenn won the Nobel Prize
in Chemistry in 2002.

EXAMPLE OF SEQUENCED PROTEIN - PICTURE


78 9. DNA & Protein Sequencing
Chapter 10

Genome Mapping

—Saddam Hossain

Genome Mapping . . . Genome Mapping . . . Genome Mapping . . . Genome


Mapping . . . Genome Mapping . . . Genome Mapping . . . Genome Mapping
. . . Genome Mapping . . . Genome Mapping . . . Genome Mapping . . .
Genome Mapping . . . Genome Mapping . . . Genome Mapping . . . Genome
Mapping . . . Genome Mapping . . . Genome Mapping . . . Genome Mapping
. . . Genome Mapping . . . Genome Mapping . . . Genome Mapping . . .
Genome Mapping . . . Genome Mapping . . . Genome Mapping . . . Genome
Mapping . . . Genome Mapping . . . Genome Mapping . . . Genome Mapping
. . . Genome Mapping . . . Genome Mapping . . . Genome Mapping . . .
Genome Mapping . . . Genome Mapping . . .

Genome Mapping
Genome Map is the guide to the Genetic Highway. Imagine that one of your
best friends has moved to Dhaka, and you are on your way to meet her at her
home. You are driving in a car down the highway to visit her. Your favorite
tunes are playing on the radio, and you haven’t care in the world. You stop to
check your maps and realize that all you have are interdivisional highway maps
- not a single street map of the area. How will you find your friend’s house?
It’s going to be difficult, but eventually, you may stumble across the right house.

This scenario is similar to the situation facing scientists searching for a spe-
cific gene somewhere within the vast genome. They have available to them two
broad categories of maps: genetic maps and physical maps. Both genetic and
physical maps provide the likely order of items along a chromosome. However,
a genetic map, like an interdivision highway map, provides an indirect estimate
of the distance between two items and is limited to ordering certain items. One
could say that genetic maps serve to guide a scientist toward a gene, just like an

79
80 10. Genome Mapping

interdivision map guides a driver from city to city. On the other hand, physical
maps mark an estimate of true distance, in measurements called base pairs,
between items of interest. To continue our analogy, physical map would then
be similar to street maps, where the distance between two sites of interest may
be defined more precisely in terms of city blocks or street addresses. Physical
maps, therefore, allow a scientist to more easily home in on the location of a
gene. An detail of how each of these maps is constructed may be helpful in
understanding how scientists use these maps to traverse that genetic highway
commonly referred to as the ”genome”.

Genome Mapping is the process of assigning or locating of a specific gene to a


particular region of a chromosome and determining the location of and relative
distances between genes on the chromosome. And there are two types of maps:
genetic map and physical map. The genetic map shows the arrangement of genes
and genetic markers along the chromosomes as calculated by the frequency with
which they are inherited together. The physical map is the representation of the
chromosomes, providing the physical distance between landmarks on the chro-
mosomes, ideally measured in nucleotide bases. Physical maps can be divided
into three general types: chromosomal or cytogenetic maps, radiation hybrid
(RH) maps, and sequence maps. The ultimate physical map is the complete
sequence itself.

A genome map helps scientists navigate around the genome. Like road maps
and other familiar maps, a genome map is a set of landmarks that tells people
where they are, and helps them get where they want to go. The landmarks
on genome map might include short DNA sequences, regulatory sites that turn
genes on and off, and genes themselves. Often, genome maps are used to help
scienctists find new genes. Road maps chart well-known territory surveyed with
astonishing precision, but a genome map is a map of a new frontier. In that
sense, a genome map is more like the maps of Bangladesh made when the Por-
tugese were just beginning to explore the continent. Some parts of the genome
have been mapped in great detail, while others remain relatively uncharted ter-
ritory. It may turn out that a few landmarks on current genome maps appear
in the wrong place or at the wrong distance from other landmarks. But over
time, as scientists continue to explor the genome frontier, maps will become
more accurate and more detailed. Genome mapping is a work in progress.

”Genome Mapping” refers to the mapping of genes to specific locations on chro-


mosomes. It is a critical step in the understanding of genetic diseases. There
are two types of genome mappings:

Genetic Mapping Genetic mapping is done using linkage analysis to deter-


mine the relative position between two genes on a chromosome.
10.1. Genetic Mapping 81

Physical Mapping Physical mapping determines the absolute position of a


gene on a chromosome.

The ultimate goal of genome mapping is to clone genes, especially disease genes.
Once a gene is cloned, we can determine its DNA sequence and study its protein
product.

10.1 Genetic Mapping


10.1.1 Landmarks of Genetic Maps
Just like inter division maps have cities and towns that serve as landmarks,
genetic maps have landmarks known as genetic markers, or ”markers” for short.
The term marker is used very broadly to describe any observable variation that
results form an alternation, or mutation, at a single genetic locus. A marker
may be used as one landmark on a map of, in most cases. that stretch of DNA
is inherited from parent to child according to the standard rules of inheritance.
Markers can be within genes that code for a noticeable physical characteristics
such as eye color, or not so noticeable trait such as disease. DNA-based reagents
can also serve as markers. These type of markers are found within the non-
coding regions of genes and are used to detect unique regions on a chromosome.
DNA markers are especially useful for generating genetic maps when there are
occasional, predictable mutations that occur during meiosis - the formation
of gametes such as egg and sperm - that, over many generations, lead to a
high degree of variability in the DNA content of the marker from individual to
individual.

10.1.2 Linkage Analysis


The genetic mapping is based on the linkage between ”loci” (locations of genes).
If two loci are usually inherited together, they are said to be ”linked”. Two loci
on different chromosomes are not linked becuase they are usually seperated by
independent assortment. A locus (singular of loci) may have different sequences,
referred to as alleles. Consider two loci A and B, each having two alleles (one
from mother, another from father). A1 and A2 are two alleles of locus A: B1
and B2 are the two alleles of locus B. Initially A1 and B1 are located on the
same chromosome. A2 and B2 are located on a different chromosome. During
recombination two pairs to sister chromatids align. DNA crossover leads to re-
combination of the chrisma is located between the two loci. And DNA crossover
does not lead to recombinatin if the chrisma is not located between the two loci.

The DNA crossover may cause recombination of loci A and B. Namely A1


and B2 (or A2 and B1 ) are located on the same chromosome. The recombina-
tion frequency depends of the distance between the two loci and the position of
crossover (the chrisma). The closer they are, the less likely the recombination
82 10. Genome Mapping

will occur, because recombination occurs only when the chrisma is located be-
tween the two loci. To apply this basic principle to map a disease gene, we need
to analyze the pedigree and estimate recombination frequency.

10.2 Physical Mapping


10.3 Restriction Mapping
Within the 3 billion or so nucleotides that make up the human genome, identi-
fying one particular gene that is composed of only several thousand nucleotides
is really a challenging task. This task has been simplified somewhat by creating
maps of the DNA. Map - very intuitively, like Country Map or Road Map, says
that it is a drawing or layout of Landmarks of a city or a road, keeping the
relative positions and distances relevant. Nothing is else in the case of DNA
Map, a map that can speak of relative positioning of different DNA Landmarks.
DNA Landmarks can be thought of as different Genomes or Traits which are
nothing but fragments of DNA Sequence expressed as a String of Nucleotides.
These landmarks may also be called Fingerprints or Markers.

10.3.1 Historical Background


The history begins with the Genetic Map. Genetic maps were first constructed
in early 20th century and preceded all other map-construction methods. In the
early ages of Genetics, it was believed that Phenotypes of Organisms such as eye
color in flies, etc, are inscribed in the Chromosomes and these are heritable for
the descendants. As a result the early maps started with the mapping of Chro-
mosome Landmarks, treated as Gene. Through the gradual improvement of
the life science and revolutionary invention of DNA the map has finally evolved
itself as a map of DNA Markers. Though we are in the era of DNA, still there
are many practical uses of the first crude Genetic Linkage Maps specially in the
field of agriculture and human disease models.

The contemporary concept of genetic map is called Physical Map. This is rela-
tively new and have advanced rapidly in the last decade because of the advances
in Clone Manipulation, High Throughput Automation and Efficient Computa-
tional Models. In the case of physical map, the distance between DNA land-
marks are expressed as a quantitative measure in term of number of Bases (kilo
bases, kb or mega bases, mb) or Nucleotides. Physical mapping is now a central
technology in deriving a finished genome DNA sequence for many genomes, spe-
cially human genome and many other model organism genomes. The physical
map is of higher resolution than the old genetic map, which was up to gene level
resolution where as physical map depicts a resolution up to base nucleotides of
the DNA. That is why physical map is the core tool to inter-relate a phenotype
to its responsible genotype (corresponding DNA sequence). In fact, a physical
map with the distances in terms of number of bases and sequenced fragments, in
10.3. Restriction Mapping 83

another words - the ultimate physical map of a DNA is its Complete Sequence.
There are three general categories of techniques to build physical map: (1) Cy-
togenetic Characterization, (2) Radiation Hybrid Mapping and (3) Restriction
Mapping. Restriction mapping is the mostly used option among these. And this
mapping model is widely used because of its rich and accurate biological model
and efficient computational model. That is why we would like start our journey
to the Bioinformatics Problems from here.

10.3.2 Restriction Map


A Restriction Map is a description of Restriction Enzymes-cleavage sites
within a piece of DNA or on a complete DNA. And Restriction Mapping is
the process of determining the Structural Information of a target DNA molecule
through the approximate positions of Restriction Sites by the use of restriction
enzymes. Restriction mapping is the first step in characterizing and sequenc-
ing an unknown DNA sequence, and a prerequisite to DNA Manipulation for
other purposes. There are several protocols for restriction mapping. Before
getting into details of restriction mapping, lets have an overview on Restriction
Enzymes (Endonucleases).

Restriction Enzyme: The discovery of restriction enzymes is a wonderful


lesson on the unexpected outcomes of a very basic research on the infection of
Bacteria by Viruses. Bacteria protect themselves against viruses using some en-
zymes. And the concept of restriction enzymes evolved from here as a bacterial
defense against DNA Bacteriophage. DNA invading a bacterial cell defended
by these enzymes will be digested into small, non-functional pieces. The name
“Restriction Enzyme” comes from the enzyme’s function of restricting access
to the cell. A bacterium protects its own DNA from these restriction enzymes
by having another enzyme that modifies these sites by adding a Methyl Group.
For example, E.coli makes the restriction enzyme EcoRI and the Methylating
Enzyme EcoRI Methylase. The methylase modifies EcoRI sites in the bacteria’s
own genome to prevent it from being digested. The combination of restriction
enzyme and methylase is termed the Restriction Modification (RM) Sys-
tem. Notably, research on restriction enzymes led to a Nobel prize for Arber,
Nathans, and Smith in 1978. Several hundred restriction enzymes have already
been explored and isolated, many are commercially available and relatively inex-
pensive. Availability of restriction enzymes has made DNA manipulation very
easy.

So in a simple language - Restriction Enzymes are enzymes that cut DNA


at specific Recognition Sequences called Sites. These enzymes also called Re-
striction Endonucleases. EcoRI is such an enzyme that cuts DNA at the
sequence ”GAATTC”. This example is a Six-Cutter restriction enzyme.

There is a standard Nomenclature system for restriction enzyme. The first


letter is the initial letter of the Genus name of the organism from which the
84 10. Genome Mapping

enzyme is isolated. The second and third letters are the initial letters of the
organisms Species name. A fourth letter, if any, indicates a particular strain of
organism. Roman numerals indicate the sequence in which different endonucle-
ases were isolated from a particular organism and strain. For example EcoRI is
found in Escherichia coli and HindIII is from Hemophilus influenzae. So, the
nomenclature of EcoRI is E = genus Escherichia, co = species coli, R = strain
RY13, I = first endonuclease isolated and HindIII is H = genus Hemophilus,
in = species influenzae, d = strain d, III = third endonuclease isolated.

Restriction Sites: Restriction sites


are the specific recognition sequence
where a specific restriction enzyme
cuts. Restriction sites are of usually
4-8 bases in length and these sites
are Palindromic that is reading the
upper strand from 5’-3’ is the same
as reading the lower strand from 3’-
5’.
Figure 10.1: Cruciform structure of Re-
5’...G-A-A-T-T-C...3’ striction Site
3’...C-T-T-A-A-G...5’

As a result, each Strand of the DNA can Self-Anneal and the DNA forms a
small Cruciform Structure. This structure may help the enzyme to recognize
the sequence that it is designed to cut.

10.3.3 Restriction Mapping Process


Constructing Restriction Maps has been a topic of interest for both Biologist
and Computer Scientist since the very beginning of Bioinformatics. There are
several processes to construct restriction maps. Some are independently bio-
logical, which are expensive, some are sole computational and the most of the
methods are the combination of Biological and Computational models, where
lies the strength of bioinformatics.First the DNA is broken into pieces and then
the locations of the breakpoints are identified. This is done using restriction
enzymes After a DNA segment has been digested using a restriction enzyme,
the resulting fragments can be examined using a laboratory method called Gel
Electrophoresis to approximate the length of the pieces of DNA. Then the map
construction is carried out. As running the gel electrophoresis frequently is very
costly, some Computational Models are also used in corporation of biological
lab-results to get the map.

The digestion can be done in many ways like - Single Digestion, in which
the DNA is digested using a single restriction enzyme. Double Digestion,
in which the DNA is digested using two different restriction enzymes at the
10.3. Restriction Mapping 85

same time and Multiple Digestion, in this case the digestion is done in the
presence of multiple restriction enzymes. Single digests are used to determine
which fragments are in the unknown DNA, and double digests to order and
orient the fragments correctly.Also according to the nature of digestion the di-
gestion can be of two categories, one is Complete Digestion, here the lengths
of DNA fragments of two consecutive restriction site are measured using gel
electrophoresis and another is Partial Digestion, in which all pair- lengths of
the distance of restriction sites are measured out. However, because the length
of each DNA fragment depends upon the position of the restriction sites, based
on this fact different computational models have been developed to reconstruct
the DNA restriction map.

Figure 10.2: Single and Double Digest

Step by Step: A Four-Step Go


1. DNA & Restriction Enzyme Selection: First is to select the target
DNA to be fingerprinted as a map, and the restriction enzyme(s) as well.
2. Incubation: The DNA fragment is then incubated with the correspond-
ing restriction enzyme (s) for digestion at the restriction sites.
3. Gel Electrophoresis: The approximate length of the digested fragments
are then measure through a costly biological lab-experiment named gel
electrophoresis.
86 10. Genome Mapping

4. Map Building: Through combination of digestions a data set is achieved


from gel electrophoresis. Using these data sets, a computation model (i.e.:
Partial Digest Problem-PDP, Double Digest Problem-DDP, etc.) is built
to form a Restriction map positioning the Restriction sites.

Restriction Mapping of a Known DNA Sequence: If the DNA sequence


is known at prior, a computer program can easily find the restriction sites and
DNA fragments for of a particular restriction enzyme (or combination of en-
zymes) and build a restriction map for further use without going to the biologi-
cal lab. There are many such computer programs, one is named Mapper under
the Molecular Toolkit used widely.

10.3.4 Uses of Restriction Mapping


Restriction mapping is a useful way to characterize a particular DNA molecule.
It enables us to locate and isolate DNA fragments for further study and manipu-
lation. The map lets us know “where we are” in the DNA macromolecule. High
resolution restriction maps are very important bioinformatics tool in prepara-
tion for DNA sequencing, even to explore the DNA sequence encoding of a
particular trait. In the large scale sequencing project it is very much necessary
too.

DNA Sequencing: Every non-viral organism contains a unique genome which


is encoded by a set of one or more Deoxyribonucleic Acid (DNA) molecules. Al-
though these molecules have a complex, double stranded structure, for most
purposes they can be modeled as string over the Alphabet {A,T,C,G}. Each of
these letters is called a Base,and a series of bases derived from a DNA molecule
is referred to as a Sequence. This string, in essence, provides the Source Code
of the Organism and hence there is great interest in sequencing or deriving a
representative sequence for the genome of organisms in order to better under-
stand them.

Current sequencing technology is capable of quickly and reliably finding the


sequence of DNA fragments of lengths up to approximately 500 bases. Such
short sequencing produced in single sequencing step are called Reads. Unfor-
tunately the length of a read is many orders of magnitude smaller than a typical
genome; for example the human genome contains approximately 3 billion bases.
In these cases, the practice is to cut a large piece of DNA into smaller fragments
to allow it to be sequenced. DNA must be chopped up into smaller pieces and
sub cloned to perform the sequencing. Then restriction mapping is an easy way
to compare DNA fragments for proper alignment of fragments.

For example, we may isolate two clones for a gene that are 8kb and 10kb long.
We know that they overlap, because the procedure used to isolate them told
that they have sequence in common. A restriction map tell how much they
10.3. Restriction Mapping 87

overlap by. From the restriction map information, we can tell which parts of
the two clones are identical and which parts are different.

DNA Cloning: In this kind of experiments, DNA or DNA fragments is not


used as a single piece. In order to obtain enough DNA mass for mapping
studies, the small DNA fragments are first cloned. Cloning is a method for
replicating a piece of DNA many times, yielding large amounts of DNA. Cloning
is frequently carried out using small circular DNAs called Plasmids. Plasmids
used for cloning are also called Vectors since they carry the DNA.

Cloning: A Three-Step Go
1. (1) DNA is cleaved into smaller fragments using one of the restriction
enzymes
2. (2) The vector DNA is cut with the same restriction enzyme
3. (3) Cut vector DNA and cut target DNA are mixed together and DNA
Ligase is added to join the vector and target DNAs. The ligated DNAs
are propagated in E. coli for Replication.

Figure 10.3: Cloning the Plasmids

Exploring Phylogenetic Relationship Restriction maps have guided the


genomic DNA sequence assemblies of many Eukaryotic organisms. Sometimes,
the Phylogenetic Relationship of a particular organism is so close to an already
existing map from another species that organization of assembled sequences
is relatively straight forward. One example of this would be the relationship
of chimpanzee sequence assemblies to the sequenced human genome.In general,
restriction map is required as a scaffold on which to assemble the final consensus
sequence, and this map becomes very important when duplications or gaps must
be resolved.
88 10. Genome Mapping

Target Region Sequencing Very often, scientists show interest to explore


a specific region of DNA suspecting to cover some particular genes. In this case
they use Clone Libraries, which are collections of short segments of DNA
that cover the target region with high redundancy. These segments are called
Clones. Clones can vary greatly in size depending on their type, but all share
the essential property that they can be reliably selected from a library and
replicated. Finally the alignment of the clone to have a complete picture of the
region is got with the help of Restriction map.

Figure 10.4: Aligning Overlapping Clones

Study of Fine Structure of a Gene By using the map to identify the


positions of the clones, one can select a set of clones which completely covers
the target gene with minimal redundancy. This set is called a Tiling Path.
By Shotgun Sequencing only those clones within the tiling path, a researcher
can save time and money. The map can also be useful in identifying corrupted
10.3. Restriction Mapping 89

clones, which are clones that have been damaged by the cloning process. Com-
mon sources for clone damage are Deletion, in which part of DNA supposedly
spanned by the clone is missing; Chimerism, in which the clone is actually
composed of DNA segments from non-contiguous areas of the target; and Coli-
gation, in which DNA from organism used to grow the clone has been added to
the clone itself. Finally the map can be used to verify the assembled sequence
for the target. Since the features of the physical map are based on features of
the underlying DNA sequence, it is possible and useful to compare the sequence
and the map to verify their consistency.

Figure 10.5: A Generic Physical Map

Analyzing Recombinant DNA Using restriction maps for analyzing Re-


combinant DNA it is possible to check the size and orientation of the Insert.
90 10. Genome Mapping
Chapter 11

Sequences Alignment

—Saddam Hossain
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . . .
Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment . .
. Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment .
. . Sequences Alignment . . . Sequences Alignment . . . Sequences Alignment
. . . Sequences Alignment . . .

11.1 DNA & Protein Sequences Comparison and


Alignment
Very natural instinct of human is to learn something by comparison. Biologists
are not exception of Comparative Analysis. Why not - as because traditionally
many comparative analysis have led to many Legendary Discoveries. Charles
Darwin made his study of comparison of Morphological Features of the Galapa-
gos Finches and other species which led him to propose the Theory of Natural
Selection - The Strong Survive. And in his book - Origin of Species(1859), he
wrote that the members of same class, independently of their habits of life, re-
semble in Unity of Type, in another word - the several parts and organs in the
different species of the same class are Homologous. Which is also a result of
comparative study of species.

Breaking the boundary of Morphological-Features-Comparison methods, since


about 30 years back, the biologists have been trying to compare genes, genomes
and proteins of different species even of the same species to explore information

91
92 11. Sequences Alignment

in three main streams - Structural, Functional and Evolutionary Relationship of


genes, genomes, proteins eventually of different species of organisms. Through
the ”Big Bang” of DNA and Protein Sequencing and revolution in speed &
power of computation for computers and invention of sophisticated bioinfor-
matics models, now in the Era of Bioinformatics the scientists are eager to
compare the Similarities and Differences of the base sequences for DNA and
amino acids sequences for protein to unveil many truth. Where as Molecular
Biologists even want to dig more into the molecular level.

Simply, whenever we have a protein or DNA sequence and want to find other
sequences that look like it there comes the question of sequence comparison.
Now a days ”Alignment” is the more accepted word than ”Comparison” as be-
cause this study of comparison involves aligning two or more sequences through
an explicit mapping of bases or amino acids. Exploration of different DNA and
protein sequences started explosively since 1970s. As a result there came the
necessity of efficient computational models and sequence database for Sequence
Alignment. Sequence Comparison and Alignment is one of the central study
Area in the arena of Bioinformatics. This is really a large topic to describe on
a section of a chapter.

11.1.1 Sequence Alignment:


Sequence Alignment is the procedure of comparing two (Pairwise Alignment)
or more ( Multiple Sequence Alignment) sequences by searching for a series of
individual characters or character patterns that are in the same order in the
sequences. This is a way of arranging the sequences of DNA, RNA, or protein
to identify Regions of Similarity that may be a consequence of functional, struc-
tural, or evolutionary relationships between the sequences.

If DNA and Protein sequences are presented by series of bases (A,T,C,G) and
(L,G,P etc) respectively, The alignment of sequences Sq1 and Sq2 is a Two
Row Matrix such that first row contains the characters of Sq1 and second row
contains characters of Sq2 keeping the order of the characters and interspersed
with some spaces to align identical characters on a vertical read. It is obvious
that there may have different alignments even for a single pair of sequences.

Measurement of Alignment: Quality of an alignment is defined by some


Scores, usually it is defined as ,it is assumed that sequences are in a 2D Matrix
form, the sum of the scores of the 2D-Matrix Columns. The column score
is often positive for the case of similar characters and negative (penalty) for
different characters.

11.1.2 Motivation for Sequence Alignment


In the early stage of research regarding Evolutionary Origin of species shows
that human and chimpanzee have Globin Genes with higher degree of similar-
11.1. DNA & Protein Sequences Comparison and Alignment 93

Hypothetical Sequences:
Sq1: ATCATTCGTGAT
Sq2: TGCAATTCGTA

Sample Alignment:
Sq1: AT - C - ATTCGTGAT
| |
Sq2: - TGCAATTCGT - A -

Figure 11.1: Sequence Alignment (Not Yet Drawn)

ity, which helped the scientist to propose a hypothesis that there may have a
Common Ancestor of evolutionary origin for them. It is believed that there is a
Natural Evolutionary Process among the organisms. According to modern sci-
ence these evolutions have been due to Mutation in the DNA. During mutation
there can arise different DNA Replication errors causing Substitutions, Inser-
tions, and Deletions of nucleotides changing the original DNA structure. And
this leads to different ways of differentiation and divergence in the descendants
of the species. Then an interesting question come into mind - how a species
or organism is evolutionary linked to another? Do they have any common an-
cestor? Hypothetically this can be derived from different kind of alignment of
DNA or Protein Sequences.

Again another research, back in 1883, on the similarities between cancer causing
v-sys oncogene and the Growth-Stimulating Hormone gave a surprising clue to
common function of them. And this research was the first success story of prov-
ing a conjecture based on sequence comparison. v-sys oncogene in the simian
sarcoma virus causes uncontrolled cell growth leading to cancer in monkeys. The
seemingly unrelated growth factor Platelet-Derived Growth Factor (PDGF) is
a protein that stimulate and regulate cell growth. When these genes were com-
pared, significant similarity was found and scientists conjectured that cancer
may be caused by a normal growth gene being switched on at the wrong time.
And the sequence comparison method came into place to establish functional
links between proteins and DNA sequences.

11.1.3 Similarity and Homology of Sequences


Two DNA or Protein sequences that have same ancestor, similar functions, and
similar structure are called Homologues or Homologous Sequences. In an-
other word these DNA or Protein sequences are very similar. By the meaning
of very similar it has been established that in the case of protein sequences
if at least 25% of amino acids are identical they are referred to very similar
and homologous sequences. And in the case of DNA, it requires at least 70%
identity among the bases of the sequences. Range of DNA or protein identity
below these threshold is know as Twilight Zone, where nothing is sure about
94 11. Sequences Alignment

the interpretation of similarity measures so here homology or non-homology


is never guaranteed. Surely only the identity measure can not ensure true ho-
mology, besides that there need to have another statistical measurement named
Expectation-value (E-value) for comparing alignments with different similari-
ties and different lengths, which indicates how likely is a sequence to be similar
to a database sequence. It measure how much trusted conclusion can be reached.

There is a clear distinction between similarity and homology. Similarity of two


sequences is the measure of identity between bases or amino acids of the corre-
sponding sequences. Similarity is a quantitative measure expressed as percent
where as homology is a qualitative measurement which refers to a conclusion
drawn from these data that two genes share a common evolutionary history, it
is binary either homologous or non-homologous, nothing is there in between.
The E-value means the number of time the database match may have occurred
just by chance. A match that is very unlikely to occur just by chance to be a
very good match. So the lowest E- values are the best, associating the most
significant sequences which can be trusted to infer homology.

Homologous genes that share a common ancestry and function in the absence of
any evidence of gene duplication are called Orthologs. When there is evidence
for Gene Duplication, the genes in an evolutionary lineage derived from one of
the copies and with the same function are also referred to as orthologs. The two
copies of the duplicated gene and their progeny in the evolutionary lineage are
referred to as Paralogs. In other cases, similar regions in sequences may not
have a common ancestor but may have arisen independently by two evolution-
ary pathways converging on the same function, called Convergent Evolution.
Such sequences are referred to as Analogous (Fitch 1970).

11.1.4 Type of Sequence Alignment


According to the nature of alignment there are two categories of alignments,
these are Global Alignment and Local Alignment.

Global Alignment: The earliest sequence alignment methods were of sim-


ple type to find straight forward relationships through similarities of the en-
tire sequences, which has established the concept of global alignment. Global
Alignment is the alignment that extends through out the entire sequences,
using as many characters as possible, up to both ends of each sequences, and
show similarity between entire sequences.

DRAW IT
DRAW IT
DRAW IT
DRAW IT

Figure 11.2: Global Alignment (Not Yet Drawn)


11.1. DNA & Protein Sequences Comparison and Alignment 95

Global alignment is meaningful for the comparison of sequences from same


class or family that are very conserved and of similar length. For example com-
parisons between members of the same protein family globins which is present
in organisms ranging from fruit flies to human even having the same lengths.
Global strategy can be applied to any homologous sequences that have not di-
versed substantially.

Local Alignment: Very frequently it is found that the score of alignments of


subsequences is larger than the score of alignment between the entire sequences,
global alignment. This scenario is addressed as local alignment problem. And
the alignment of subsequences instead of aligning entire sequences is called Lo-
cal Alignment. In local alignment, stretches of sequences with the highest
density of matches are aligned, thus generating one or more islands of matches
or subalignments in the aligned sequences.

DRAW IT
DRAW IT
DRAW IT
DRAW IT

Figure 11.3: Local Alignment (Not Yet Drawn)

Local alignment is used when the length of sequences are not same and
there is possibility of similarity in a specific area (subsequence) of the sequences
but dissimilar in others, sequences that differ in lengths but share a Conserved
Region or Domain. For example, homeobox genes, which regulate embryonic
development, are present in large variety of species. Although homeobox genes
are very different in different species, one region of them called Homeodomain
is highly conserved. Local alignments can create high quality alignments mak-
ing Residue-per-Residue analysis. It consists of paired subsequences that may
be surrounded by residues that are completely unrelated.. Another obvious case
where local alignments are desired is the alignment of the nucleotide sequence of
a Spliced mRNA to its genomic sequence, where each exon would be a distinct
local alignment. Proteins that have a significant biological relationship to one
another often share only isolated regions of sequence similarity. For identifying
relationships of this nature, the ability to find local regions of optimal similarity
is advantageous over global alignment.

Considering the number of sequences concerned to the alignment process, the se-
quence alignment methods are of two types one is Pairwise Sequence Alignment
and another is Multiple Sequence Alignment.

Pairwise Sequence Alignment: When the alignment is done between two


sequences is called Pairwise Sequence Alignment. The following parts of this
section will discussed in terms of pairwise sequence alignment.
96 11. Sequences Alignment

DRAW IT
DRAW IT
DRAW IT
DRAW IT

Figure 11.4: Pairwise Sequence Alignment (Not Yet Drawn)

Multiple Sequence Alignment: Multiple Sequence Alignment shows the


alignment of more that two sequences. Multiple sequence alignment will be
discussed on next section of this chapter.

DRAW IT
DRAW IT
DRAW IT
DRAW IT

Figure 11.5: Multiple Sequence Alignment (Not Yet Drawn)

11.1.5 Computational Methods & Models for Sequence


Alignment

Sequence comparison is the most ever difficult task for biologists. Most of the
computational models for sequence alignments use insertion, deletion, substitu-
tion and some slightly different set of operations on base nucleotides or amino
acids to incorporate the concept of mutation in the alignment models. In gen-
eral finding differences is often equivalent to finding similarities because these
models try to find alignments of sequences in terms of alignment distance or
edit distance.

Though the computational models are evolving every day with more efficiency
than the before, sequence database is growing more faster than that. As a result
the methods using for long time, may perform extraordinary 20 years ago, are
now too slow to search a sequence database of 109 entries. As a result now
a days performance is gained through parallel hardware implementation or us-
ing fast heuristics that usually work well but not guaranteed to find the closet
match in every cases. There are lots of scope to work on this for the future in
the improvement of performance and accuracy.

There have been evolved many methods for sequence alignment. The most
basic and widely used methods are Dot Matrix and Dynamic Programming,
which are discussed in the following few sections.
11.1. DNA & Protein Sequences Comparison and Alignment 97

11.1.5.1 Dot Matrix


The Dot Matrix is first described by Gibbs and McIntyre (1970). It has been a
good strategy to start with Dot Matrix as because this a very powerful tool to
have an instant general picture of pairwise alignment which can help decide the
next step of analysis. Dot matrix is not enough for fine-grained examination.
It does not produce any real alignments; rather they simply provide the generic
instructions. It helps identify potential repeats within each sequence. But the
real alignment comes through global or local alignments. Lets have a look on
the Dot Matrix of two hypothetical sequences on next paragraph.

Preparation of Dot Matrix: A Dot Matrix is a 2D matrix or grid having


one sequence on the top writing from left to right and another sequence on left
stretching from top to bottom. For simplicity it can be thought as playing a
Tic-Tac-Toe. After putting the sequences on the top and left of the matrix, the
cells of the matrix are now being populated by dot (for better understanding
lets assume as a cross) where the row and column value of the cell is identical
(same character). Any region of similar sequence is revealed by a diagonal row
of dots. Isolated dots not on the diagonal represent random matches that are
probably not related to any significant alignment.

DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT
DRAW IT

Figure 11.6: Dot Matrix (Not Yet Drawn)

Usage of Dot Matrix: Unless the sequences are known to be very much
alike, the dot matrix method should be used first, because this method displays
any possible sequence alignments as diagonals on the matrix which is needed
for general exploration of the sequences. Dot matrix can readily reveal the
presence of insertion/deletions and direct and inverted repeats that are more
difficult to find by the other, more automated methods. The major limitation of
the method is that most dot matrix computer programs do not show an actual
alignment.

Dot Matrix is also used for predicting regions in RNA that are self-complementary
and that, therefore, have the potential of forming secondary structure. The ma-
jor advantage of the dot matrix method for finding sequence alignment is that
all possible matches of residues between two sequences are found, leaving the
investigator the choice of identifying the most significant ones. Then sequences
98 11. Sequences Alignment

of the actual regions that align can be detected by using dynamic programming
afterwards.

Dot Matrix can reveal complex relationships involving multiple regions of lo-
cal similarities. In a dot- matrix representation, certain patterns of dots may
appear to sketch out a ”path”,but it is up to the biologist to deduce the align-
ment from this information. This graphical representation known as a path
graph provides an explicit representation of an alignment.

A dot matrix can also reveal the presence of repeats of the same sequence
characters many times. These repeats become apparent on the dot matrix of a
protein sequence against itself as horizontal or vertical rows of dots that some-
times merge into rectangular or square patterns.

11.1.5.2 Dynamic Programming


Dynamic programming is a computational method that is used to align two
protein or nucleic acid sequences. The method is very important for sequence
analysis because it provides the very best or optimal alignment between se-
quences. Both global and local types of alignments may be made by simple
changes in the basic dynamic programming algorithm. The dynamic program-
ming method, first used for global alignment of sequences by Needleman and
Wunsch (1970) and for local alignment Smith and Waterman (1981) proposed
a clever modification of dynamic program to solve it.

Preparation of Dynamic Programming Model: The method compares


every pair of characters in the two sequences and generates an alignment. This
alignment will include matched and mismatched characters and gaps in the two
sequences that are positioned so that the number of matches between identi-
cal or related characters is the maximum possible. The dynamic programming
algorithm provides a reliable computational method for aligning DNA and pro-
tein sequences. The method has been proven mathematically to produce the
best or optimal alignment between two sequences under a given set of match
conditions. This procedure generates a matrix of numbers that represents all
possible alignments between the sequences. The highest set of sequential scores
in the matrix defines an optimal alignment. Beforehand sophisticated scoring
schemes need to be built. For example a very general scoring scheme may assign
positive incremental scores for aligning identical residues and negative scores for
substitution and gaps.

Usage of Dynamic Programming Model: This method outputs optimal


alignments, which provide useful information to biologists concerning sequence
relationships by giving the best possible information as to which characters in
11.1. DNA & Protein Sequences Comparison and Alignment 99

a sequence should be in the same column in an alignment, and which are in-
sertions in one of the sequences (or deletions on the other). This information
is important for making functional, structural, and evolutionary predictions on
the basis of sequence alignments.

Fortunately, experience with the dynamic programming method has provided


much help for making the best choices, and dynamic programming has become
widely used. The dynamic programming method can also be slow due to the
very large number of computational steps, which increase approximately as the
square or cube of the sequence lengths. The computer memory requirement also
increases as the square of the sequence lengths. This it is difficult to use this
method for very large sequences. Fortunately, computer scientists have greatly
reduced these time and space requirements to near-linear relationships without
compromising the reliability of the dynamic programming method.

Another feature of dynamic programming algorithm is that the alignments ob-


tained depend on the choice of a scoring system for comparing character pairs
and penalty scores.

It is important to bear in mind that optimal methods always report the best
alignment that can be achieved, even if it has no biological meaning. On the
other hand, when searching for local alignments there may be several significant
alignments, so it is a mistake to look only at the optimal one.

11.1.6 Importance of Sequence Alignment


Similar sequences often derive from the same ancestral sequence. If some se-
quences are similar they probably have the same ancestor, share the same struc-
ture, and have a similar biological function. This principle - ”if something is
true for a sequence, it is probably true for similar sequences”, even works when
the sequences are of very different organisms. So, using sequence alignments
some particular information can be extrapolated from known DNA or protein
sequence to all similar DNA and protein sequences.

Sequence alignment is useful for discovering functional, structural ,and evo-


lutionary information in biological sequences. It is important to obtain the
best possible or so-called ”optimal” alignment to discover this information.
Sequences that are very much alike, or ”similar” in the parlance of sequence
analysis, probably have the same function, be it a regulatory role in the case of
similar DNA molecules, or a similar biochemical function and three dimensional
structure in the case of proteins. Additionally, if two sequences from different
organisms are similar, there may have been a common ancestor sequence, and
the sequences are then defined as being homologous. The alignment indicates
the changes that could have occurred between the homologous sequences and a
common ancestor sequence during evolution.
100 11. Sequences Alignment

With the advent of genome analysis and large-scale sequence comparisons, it


becomes important to recognize that sequence similarity may be an indicator
of several possible types of ancestor relationships, or there may be no ances-
tor relationship at all. New gene evolution is often thought to occur by gene
duplication, creating two tandem copies of the gene, followed by mutation in
these copies. In rare cases, new mutations in one of the copies provide an ad-
vantageous change in function. The two copies may then evolve along separate
pathways. Although the resulting separation of function will generate two re-
lated sequence families, sequences among both families will still be similar due
to the single gene ancestor. In addition, genetic rearrangements can reassort
domains in proteins, leading to more complex proteins with an evolutionary
history that is difficult to reconstruct.

There is also a complication in tracing the origins of similar sequences is that


individual genes may not share the same evolutionary origin as the rest of the
genome in which they presently reside. Genetic events such as symbioses and
viral-induced transduction can cause horizontal transfer of genetic material be-
tween unrelated organisms. In such cases, the evolutionary history of the trans-
ferred sequences and that of the organisms will be different. Again, with the
capability of detecting such events in the genomes of organisms comes the re-
sponsibility to describe these changes with the correct evolutionary terminology.
In this case, the sequences are xenologous.

11.1.7 Sequence Alignment Tools


FASTA: The first widely-used program for database similarity searching was
FASTA (Lipman and Pearson, 1985; Pearson and Lipman, 1988; Pearson, 2000).
The program PLALIGN in the FASTA may be used to display dot matrix.

BLAST: BLAST is the mostly used tool to align a sequence against a se-
quence database. There is a more powerful version of BLAST named PSIBLAST
which can answer some more biological queries than BLAST. The BLAST pro-
grams introduced a number of refinements to database searching that improved
overall search speed and put database searching on a firm statistical foundation
(Altschul et al., 1990)

11.2 Multiple Sequence Alignment


DNA Sequences and Protein Sequences of different organisms are often related.
This is because, similar genes are present across widely divergent of species with
the similar or identical functions or with some evolutionary changes through mu-
tation or rearranging due to forces of natural selection reflecting altered func-
tions. As a result many genes or patterns remain very conserved in the genomes,
and a pairwise comparison or alignment cant find that. To explore this kind of
conserved pattern it is necessary to compare & align simultaneously multiple
11.2. Multiple Sequence Alignment 101

sequences. This is the need and motivation behind the development of Multiple
Sequence Alignment. ”One amino acid sequence plays coy; a pair of homologous
sequences whisper; many aligned sequences shout out loud.”

Multiple Sequence Alignment is the process of aligning more that two sequences
simultaneously. For an illustration, lets have four hypothetical protein sequence
-SqeA, SeqB, SeqC and SeqD. Their multiple sequence alignment is shown bel-
low with the substitution of (F/Y) and deletion of (L) and insertion of ( K).
And tree is an evolutionary tree for this group of sequences.

Sequences
SeqA: NFLS
SeqB: NFS
SeqC: NKYLS
SeqD: NYLS

Multiple Sequence Alignment

SeqA: N * F L S
SeqB: N * F - S
SeqC: N K Y L S
SeqD: N * Y L S

Figure 11.7: Multiple Sequence Alignment - Evolutionary Tree


102 11. Sequences Alignment

11.2.1 Methods for Multiple Sequence Alignment


Multiple Sequence Alignment is completely a computational problems with dif-
ferent aspects of computational challenges. The traditional Dynamic Program-
ming model, which is very much suited for Pairwise Sequence Alignment, can
be extended to more sequence alignment. But this is really a very difficult chal-
lenge because for more than three sequences, only a small number of relatively
short sequence may be analyzed. As a result different approximation models
are used, some of these are like following.

11.2.1.1 Dynamic Programming based Models


Progressive Global Alignment is one of the multiple sequence alignment pro-
cess that uses dynamic programming for optimal alignment. In this method,
pairwise alignments are done first among the most alike sequences. Then an
alignment is built by adding more sequences.There is another way called Itera-
tive Model, this also use dynamic programming for finding optimal alignments.
In iterative model, alignments are first done for several groups or classes. And
these alignments are used to align themselves to have much more reasonable
alignments.

The major problem with the progressive alignment method described above
is that errors in the initial alignments of the most closely related sequences are
propagated to the multiple sequence alignment. This problem is more acute
when the starting alignments are between more distantly related sequences.
Iterative models attempt to correct for this problem by repeatedly realigning
subgroups of the sequences and then by aligning these subgroups into a global
alignment.

But there is an inherent problem with dynamic programming model that is


finding a reasonable scoring matrix, which problem gets more complex when
more than two sequences get involved simultaneously. It also grows in sizes
exponentially (as the power of number of sequences). As a result the computa-
tional complexity and storage requirement grows higher and for higher number
of sequences it becomes impractical. Dynamic programming is good for three
sequences with lower length of sequences. So the challenge to the multiple se-
quence alignment method is to utilize an appropriate combination of sequence
weighting, scoring matrix, and gap penalties.

11.2.1.2 Statistical Methods and Probabilistic Models


Different Statistical and Probabilistic methods are used to approximate the Mul-
tiple Sequence Alignment model. The mostly used statistical and probability
based model is the Hidden Markov Model (HMM) which considers all possible
combinations of matches, mismatches, and gaps to generate an alignment of a
set of sequences. HMMs often provide a multiple sequence alignment as good
11.3. Regulatory Motif Finding 103

as, if not better than, other methods. A model of a sequence family is first pro-
duced and initialized with prior information about the sequences. The model
is trained with a good number of sequences first. The trained model is then
used to produce the most probable multiple sequence alignment as posterior
information. As a result it is modeled based on completely probability theory,
no sequence ordering is needed, insertion/deletion penalties are not needed, and
experimentally derived information can be used.

11.2.2 Usage of Multiple Sequence Alignment


Alignment of a pair of DNA or Protein sequence depicts the relationship among
the two sequences, where as multiple sequence alignment provide information
of the sequence as to the most like regions or classes it may be related to.
In proteins, such information may provide conserved functional and structural
domains. And for DNA sequence it reveals the information for evolutionary
relationships.

Multiple Sequence Alignment is the evolutionary history of sequences. If the


sequences align well in the Multiple Sequence Alignment ,they are likely to be
derived from a common ancestor sequence, and for poor alignment score, they
may share distant evolutionary relationship. This leads to the discovery of evo-
lutionary relationships among the sequences.

The goal of protein sequence comparison is to discover structural or functional


similarities among proteins. Biologically similar proteins may not exhibit a
strong sequence similarity, but we would still like to recognize resemblance even
when the sequences share only weak similarities. If sequence similarity is weak,
pairwise alignment can fail to identify biologically related sequences because
weak pairwise similarities may fail statistical tests for significance. However si-
multaneous comparison of many sequences often allows one to find similarities
that are invisible in pairwise sequence comparison.

11.2.3 Tools for Multiple Sequence Alignment


• CLUSTALW

• HMMER

11.3 Regulatory Motif Finding


There is no harm to assume and say that the complete program of an organism
is written into its DNA sequence using the DNA Linguistics or Genetic Lan-
guage, the whole world is running after to find the Holly Grail of this language
to read the message which has been inscribed into the organism. Scientists have
sequenced may organisms’ DNA completely, but interpretation of that is still
far away. Only the first step has been done - Sequencing the DNA (encoded
104 11. Sequences Alignment

message), still to do - Decode the Message (source code of life).

Quoting a very obvious case study of Fruit-Flies, due to the lacking of so-
phisticated Immune System fruit flies get infections by bacterial attack. But
luckily they have a small set of Immunity Genes that usually remain dormant,
and get switched on when they are infected. When these genes are turned on,
some protein (antibody) is produced to destroy the pathogen and cure the in-
fection. Through the use of DNA Array, a biologist can do a lab-experiment
taking the infected and not-infected flies into account to determine what trig-
gers the activation of the immunity gene. The DNA sequence that has switched
on the immunity gene through encouraging the RNA Polymerase to transcribe
the genes into proteins, is called Regulatory Motif.

11.3.1 Gene-Regulation & Regulatory Motif


In the cells, there remain many Transcription Factors (TF), which are pro-
tein that control gene expression process of creation of another protein from the
DNA sequence of the gene. Every gene contains a Regulatory Region (RR) typ-
ically stretching 100-1000 bp upstream of the Transcriptional Start Site (TSS).
Transcription Factor Binding Sites (TFBS) remain within the Regulatory
Region, where the corresponding Transcription Factor binds to initiate the gene
expression process. Transcription Factor Binding Sites (TFBS) are also known
as Motifs, specific for a given transcription factor. A TFBS can be located
anywhere within the Regulatory Region. TFBS may vary slightly across differ-
ent regulatory regions since non-essential bases could mutate. Transcriptional
Regulation acts as either enhancer or repressor to make the gene expressed. So,
Regulation of Genes encompasses the process of WHAT gene(s) will turn on
(producing protein) or off WHEN & WHERE. And Regulatory Motifs regulate
the gene expression by attracting the RNA-polymerize.

Usually motifs are short sequences (5-25 bp). Graphically motifs are pre-
sented by a special type of symbol called Motif Logo, which shows the con-
served and variable region of a motif with a variable size distribution of the
containing symbols.

11.3.2 Motif Discovery Methods


There is an analogy between Regulatory Motif Finding and The Gold Bug-story
by Edgar Allan Poe. In the story, William Legrand finds a parchment written
by the pirate Captain Kidd, but it was encrypted as bellow.

53++!305))6*;4826)4+.)4+);806*;48!860))85;]8*:+*8!83(88)5*!;46(;88*96*?;8)
*+(;485);5*!2:*+(;4956*2(5*-4)88*; 4069285);)6!8)4++;1(+9;48081;8:8+1;48!8
5;4)485!528806*81(+9;48;(88;4(+?34;48)4+;161;:188;+?;
11.3. Regulatory Motif Finding 105

Figure 11.8: Gene Regulation

Figure 11.9: Motif Logo

He assumed that the message was written in english and each letter is replaced
by a symbol, and tried to solve the puzzle, firstly applying the frequency dis-
106 11. Sequences Alignment

Figure 11.10: Motif - Schematic Diagram

tribution of letters in english language, but this came up with a meaningless


implication. Then he tried to map 3-tuples words and got the way of success
by finding ”;48” mapped with ”the” (the most frequent 3-tuples word used in
english language). And based on this he completely deciphered the message,
which finally was -

A GOOD GLASS IN THE BISHOPS HOSTEL IN THE DEVILS SEA, TWENY


ONE DEGREES AND THIRTEEN MINUTES NORTHEAST AND BY NORTH,
MAIN BRANCH SEVENTH LIMB, EAST SIDE, SHOOT FROM THE LEFT
EYE OF THE DEATHS HEAD A BEE LINE FROM THE TREE THROUGH
THE SHOT, FIFTY FEET OUT.

Unfortunately, DNA texts are not that easy to decipher and DNA Linguis-
tics is not known even not the Genetic Grammar. There is also no dictionary of
motifs in hand. Except only the information that frequent or rare DNA-words
(substring) may carry some signals regarding Genetic Language of the organism.

Direct experimental determination of regulatory motifs is not practical or ef-


ficient in many biological systems. Though most of the time the regulatory
motifs are of short length, they may have some degree of sequence variations
within them being of the same type or class and this has made the methods
more harder and complex to find regulatory motifs computationally.

The complications arise because we do not know the motif sequence before-
hand, even we don’t know where it is located relative to the genes start, also
they can differ slightly from one gene to the next. Finding a pattern without
11.3. Regulatory Motif Finding 107

any solid prior knowledge of it - really tough for any computational model.

From the very top line, the Regulatory Motif Finding problems include Multiple
Sequence Local Alignment first to create the alignment of l-mers. Then profiling
is done to score and build a consensus motif string that is thought to be the
ancestor motif of the corresponding function.

Figure 11.11: Motif Finding

There have been developed two distinct ways to solve this problem. One is
Combinatorial Computational Approach and other is Probabilistic Based Meth-
ods. The combinatorial approach includes Brute Force Motif Finding, the Me-
dian String Problem, Search Trees, Search Trees with Branch-and-Bound Tech-
niques, Consensus based Greedy Motif Search and Exhaustive Motif Search
models. And probabilistic approaches use Expectation Maximization, Profile
Hidden Markov Model (HMM) etc.
108 11. Sequences Alignment

11.3.3 Tools for Motif Finding


• MEME
• CONSENSUS

• IDENTIFY
• SCAN
• MOTIFS
Chapter 12

Gene Prediction

—Saddam Hossain
Gene Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene
Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene Prediction
. . . Gene Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene
Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene Prediction
. . . Gene Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene
Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene Prediction
. . . Gene Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene
Prediction . . . Gene Prediction . . . Gene Prediction . . . Gene Prediction . .
. Gene Prediction . . . Gene Prediction . . . Gene Prediction . . .

12.1 Introduction to Genome Annotation & Gene


Prediction
About hundreds of organisms’ genomes have already been entirely sequenced
and thousands are in production, but this is not the end of genome projects. De-
coding the mysterious code written into the raw DNA sequence in its own DNA
linguistic is the next challenge. With this motivation all the scientists world-
wide have made themselves engaged into exploring this raw DNA sequence and
repositing the explored knowledge about the genome in some standard form,
which can be called genome annotation. Genome Annotation is the pro-
cess of accumulating pertinent information about a raw DNA sequence. In
this process different coding regions are identified, which can be called genes.
Genome annotation encompasses identification and information of different part
of the gene like regulatory information, parts that are transcribed into mRNA
and a particular protein eventually. Genome annotation includes gene product
names, functional and physical characteristics of the gene or protein and finally
the overall metabolic profile of the genome. Genome annotation is a combina-
tion of manual and automatic methods. Aggression of computer always does the

109
110 12. Gene Prediction

preliminary job of annotation, and high quality annotation is achieved through


manual review. Example of so far well-annotated genomes are from Yeast, Fruit
fly, Mouse, Human etc.

Among the different elements of annotation process like - homology search,


functional assignment etc., gene finding or prediction is the primitive and most
important one. Hypothetically Gene Prediction is the process of detecting
meaningful signals from the raw-uncharacterized DNA sequence to explore the
knowledge and information, about the corresponding organism, inscribed in it
like Chaff from the Wheat. From the perspective of bioinformatics, gene predic-
tion is the process of recognizing protein-coding regions or genes in the genomic
sequence. And another part of gene prediction is promoter prediction, sequences
that regulates activity of protein encoding.

Figure 12.1: Gene Prediction

12.1.1 Gene Finding Principles and Guidelines


Gene finding is one of the first and most important steps in understanding the
genome of a species once it has been sequenced. Gene prediction is important
for speeding up the wet lab work to provide insight of estimation of gene which
is complementary to wet lab work. In prokaryotic organism the genes are not
interspersed through intron among the exons, they have contiguous gene. But
in the eukaryotic organism genes are sparse. even a single gene is not a contigu-
ous sequence of nucleotides. It is very notable that among 3 billion base pairs
there may have 30,000-100,000 genes, so the gene percentage is ¡1%. The fol-
lowing picture shows a hypothetical concept of gene coding density in different
organisms.

Figure 12.2: Gene Coding Density


12.1. Introduction to Genome Annotation & Gene Prediction 111

From the central dogma of life, it can be derived that a part of raw DNA
sequence transcribed into mRNA to synthesis a particular protein eventually
through the steps of pre-transcription, transcription, splicing and translation.
The raw DNA sequence that goes under transcription process is called Tran-
scribed Region or Gene Coding Segment(CDS). And between two CDSs
there remain some regulatory segments organized in the upstream area of a
gene. The upstream area is comprised of Enhancer, Upstream Promoter, Motif,
Core Promoter, GC-box, CAAT-box, TATA-box, INR-box, Transcription Start
Site(TSS) etc. this gene-upstream region is also called Flanking Regions. In
a simple thought, the whole genome is a N-Times repetitions of Flanking Region
and Coding Segment pairs.

During the Gene prediction or identification, finding out the gene coding
segment is central focus or area of concentration. Coding Segment consists two
type of segments - exons(coding sequence of the gene) and introns(sequence
that does not transcribed into protein), and four type of signals - start codon
(ATG), donor splice sites(usually GT), acceptor splice sites (usually AG) and
stop codons (TAG, TGA, TAA). Again there can be four types of exons - (i)
initial exons that extend from a start codon to the first donor site. (ii) internal
exons, which extend from one acceptor site to next donor site. (iii) final exons
extend from the last acceptor site to the stop codon. (iv) And sometime there
found intronless exon which is called single exon, not interrupted by non-coding
segments.

Except extrinsic or evidence based gene prediction, all methods are based on
the calculation and finding of the above mentioned gene-markers (start & end
codon, splice sites, promoter etc.). Which are really complex and the complex-
ity increases as the non-coding (introns) area increases and coding ( exons) area
decreases in length. Also finding splicing sites are difficult because GT and AC
appear very often. As a result all the gene prediction start with ab-initio meth-
ods and repeatedly predicted and evaluated using approximation methods, its
really hard to have exact or perfect prediction always.

The simplest method for finding DNA sequences that encode proteins or repre-
sent genes is by searching for Open Reading Frames (ORF). An open reading
frame is a DNA sequence that contains a contiguous set of codons that reflect
an amino acid. There can have six possible reading frames.

For example, the following sequence of DNA can be read in six reading frames.
Three in the forward and three in the reverse direction. The three reading
frames in the forward direction are shown with the translated amino acids be-
low each DNA sequence. Frame 1 starts with the ”a”, Frame 2 with the ”t”
and Frame 3 with the ”g”. Stop codons are indicated by an ”*” in the protein
sequence. The longest ORF is in Frame 1.

There are some issues regarding the gene prediction the first one is the size of
112 12. Gene Prediction

Figure 12.3: Gene Segments

the genome, larger the genome, the more genes and complex to find. And more
complexity results less coding density or fewer genes per kbp. It is assumed
that long ORFs tend to be coding. As the coding to non-coding region- length
ratio decreases exon or gene prediction becomes more complex.
12.1. Introduction to Genome Annotation & Gene Prediction 113

Figure 12.4: Open Read Frames

12.1.2 Gene Prediction Approaches


Gene prediction approaches start with extrinsic gene finding and extending it
to ab-initio prediction. After that the predicted model is tuned with different
computational and statical methods, which can be depicted as following.

Figure 12.5: Gene Prediction Methodologies

12.1.2.1 Extrinsic approaches


In extrinsic (or evidence-based) gene finding systems, the target genome is
searched for sequences that are similar to extrinsic evidence in the form of
the known sequence of a messenger RNA (mRNA) or protein product. Given
an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from
which it had to have been transcribed. Given a protein sequence, a family
of possible coding DNA sequences can be derived by reverse translation of the
genetic code. Once candidate DNA sequences have been determined, it is a rela-
tively straightforward algorithmic problem to efficiently search a target genome
for matches, complete or partial, and exact or inexact. BLAST is a widely used
114 12. Gene Prediction

system designed for this purpose.

12.1.2.2 Ab-initio Gene Prediction


Because of the inherent expense and difficulty in obtaining extrinsic evidence for
many genes, it is also necessary to resort to Ab initio gene finding, in which ge-
nomic DNA sequence alone is systematically searched for certain tell-tale signs
of protein-coding genes. Ab-initio means from the beginning, ab-initio prediction
is prediction of gene from raw DNA sequence via rule-based and evidence-based
gene model. There are several features of an ORF like its size, exons, introns,
coding segments, DNA composition in terms of codon usage, Kozak sequence
(CCGCCAUGG), ribosome binding sites, gene start signals or start codons, ter-
mination signals or stop codons, transcription start site (TSS), splice junction
boundaries etc. A coding segment (CDS) gene prediction can be thought as
prediction of linear series of sequence feature. An Ab-initio predictor locates
and scores all these sequence features. This prediction model can be built by
Dynamic Programming Model, Markov Model and Neural Network Model, De-
cision Tree, Integration of Various Statistical Approaches. Among these Hidden
Markov Model (HMM) is mostly used. And obviously the prediction model
need to be trained on a specific organism.

In the Ab-initio prediction method there needs a good set of training data
for the evaluation of statistical likelihood of a prediction being real. Ab-initio
prediction is never perfect. It has high false positive rates. Incorporation of
similarity test model may reduce the false positive rate but it will increase the
false negative rate. This model is rarely used as a final product, but for a start.

12.1.2.3 Comparative Gene Prediction


If a cell was human! Each cell knows how to splice a gene together, if human
knows! We have known some of these signals but not all yet. Prediction using
the knowledge of known coding sequences to identify region of genomic DNA
by similarity comparison with known examples from the species genome (re-
lated genomic sequence), transcriptome (transcribed DNA sequence), proteome
(peptide sequence) database is another method for gene prediction called com-
parative gene prediction. This is an extension of ab- initio prediction method.
Statistically approaches are applied in this method. HMM models are used
widely in predicting the closest CDS to a supplied peptide or nucleotide se-
quence. The statistical methods also use EST alignment to predict Intron/Exon
boundaries.

12.1.2.4 Homology-based Methods


Homological approach identifies genes with the aid of experimental data. This
approach exploits the alignment gene sequence between genomic data and the
12.1. Introduction to Genome Annotation & Gene Prediction 115

known cDNA (or protein) database. Among different homology based methods
Local Alignment Methods and Pattern-based Alignment Methods are used.

12.1.3 Gene Prediction Tools


• SNAP
• TwinScan
• Gnomon (NCBI)
• GeneWise

• Jigsaw
• GLEAN
• Grail

• BLAST
• FASTAX
• BLAT
• WABA

• MZEF,
• MZEF-SPC
• FGENESH
116 12. Gene Prediction
Chapter 13

Genome Analysis

—Saddam Hossain

Genome Analysis . . . Genome Analysis . . . Genome Analysis . . . Genome


Analysis . . . Genome Analysis . . . Genome Analysis . . . Genome Analysis
. . . Genome Analysis . . . Genome Analysis . . . Genome Analysis . . .
Genome Analysis . . . Genome Analysis . . . Genome Analysis . . . Genome
Analysis . . . Genome Analysis . . . Genome Analysis . . . Genome Analysis
. . . Genome Analysis . . . Genome Analysis . . . Genome Analysis . . .
Genome Analysis . . . Genome Analysis . . . Genome Analysis . . . Genome
Analysis . . . Genome Analysis . . . Genome Analysis . . . Genome Analysis .
. . Genome Analysis . . . Genome Analysis . . . Genome Analysis . . .

For Draft
The entire DNA content of the cell is what is known as genome. The segment of
genome that is transcribed into RNA is called gene. Simply Genome Analysis
is a process which analyses the genome.

Segments of genome called genes determine the sequence of amino acids in pro-
teins. The mechanism is simple for the prokaryotic cell where all the genes are
converted into the corresponding mRNA (messenger ribonucleic acid) and then
into proteins. The process is more complex for eukaryotic cells where rather
than full DNA sequence, some parts of genes called exons are expressed in the
form of mRNA interrupted at places by random DNA sequences called introns.
Of the several questions posed here, one is that how some parts of the genome
are expressed as proteins and yet other parts (introns as well as intergenic re-
gions) are not expressed.

Genome analysis problem entails the prediction of genes in uncharacterized ge-


nomic sequences. The 21st century has seen the announcement of the draft

117
118 13. Genome Analysis

version of the human genome sequence. Model organisms have been sequenced
in both the plant and animal kingdoms. As we begin the new millennium, the
major goal of molecular biology is to obtain the complete sequences of as many
genomes as possible. A comparison of the genome sizes of different organisms
(Table 1) raises questions like what types of genetic modifications are respon-
sible for the four times large genome size of wheat plant and seven times small
size of the rice plant as compared to that of humans. Mice and humans contain
roughly the same number of genes . about 28K protein coding regions. The
chimp and human genomes vary by an average of just 2% i.e. just about 160
enzymes.

Organism Genome Size (Mb) Mb=Mega base


Eschericia coli 4.64
M tuberculosis 4.4
H.Influenza 1.83
Homo sapiens 3300
Mouse 3000
Rice 430
Wheat 13500

Genome Sequencing:

Open Reading Frames (ORF):

The Genetic Code:

Comparative Genome Analysis:

Genome Annotation:

Genome Rearrangement:

Gene Prediction:

Genome Similarity:

Expressed Sequence Tags:

DNA Microarrays:
Chapter 14

Phylogenetic Analysis

—Saddam Hossain

Phylogenetic Analysis . . . Phylogenetic Analysis . . . Phylogenetic


Analysis . . . Phylogenetic Analysis . . . Phylogenetic Analysis . . .
Phylogenetic Analysis . . . Phylogenetic Analysis . . . Phylogenetic Analysis
. . . Phylogenetic Analysis . . . Phylogenetic Analysis . . . Phylogenetic
Analysis . . . Phylogenetic Analysis . . . Phylogenetic Analysis . . .
Phylogenetic Analysis . . . Phylogenetic Analysis . . .

14.1 Introduction of Phylogeny


Phylogenetics is the study of evolutionary or ancestral relationships among or-
ganisms or a/more group of organisms or genes. The molecular data like DNA
and Protein sequences can be used to reconstruct or infer this relationships.

Purpose of Phylogenetics The principles of phylogenetics is to illustrate


or infer relationships among organisms. The purpose of phylogenetics can be
stated as

• Constructing or infering evolutionary or ancestral ties between organisms.

• Estimating time of divergence between organisms when they last shared


a common ancestor.

Phylogeny In the study of phylogenetics, the relationship is shown by a


branching structure which is called phylogeny or tree of history of life.

Phylogenetic Data Types There are two types of data that are available
and usually used in the phylogenetic analysis.

119
120 14. Phylogenetic Analysis

• Discrete Character Based Data (Categorial): The discrete data


are categorial data, expressing qualitative information of morphological
or molecular status. For example, type of beaks, number of legs, states of
column in aligned DNA or Protein sequences. This data is not numerical
or not continuous.
• Numerical Data: There are few data of measurements are available or
derived in the process of phylogenetics analysis. For example, measure-
ment of dissimilarities between two sequence etc.

14.2 Concept of Evolution & Evolutionary Model


Species, population and genes are evolving since the first existance of life in the
universe. It is believed that in the course of evolution, sometimes two popula-
tions or species or genes become reproductively divergent through bifurcation
process. Random Mutation Process is responsible for this evolution and diver-
gency.

Ancestor Sequence A(T/C)G

ATG ACG

Figure 14.1: DNA Sequence Evolution

Divergence consists of changes of characters, in the molecular perspective


this is change of nucleotides in DNA sequence or change of animo acids in pro-
tein sequence. Over time, may be years, may be million of years, this process
of bifurcation and divergency have been happening repeatedly. Each popula-
tion or species may be related to each other thrugh some bufurcation process.
So closely related population may share a direct or indirect common ancestor,
from where they had been bifurcated. ”It should be possible to work backwards
in time, ascending the relationships (phylogenetic tree) of common ancestory,
until a common ancestor of all populations in the set is reached!” - this is the
concept and motivation behind the exploration of evolutionary relationships
(phylogenetic analysis) in terms of a model called Evolutionary Model. Evo-
lutionary model finds the ancestral relationships among populaitons through
phylogenetic analysis or phylogenetics.

Some Concepts/Terminologies in Necessary


Homology & Similarity: Similarity is the measure of resemblance or differ-
ence between two or more sequences. This is a numerical figure usually pre-
14.2. Concept of Evolution & Evolutionary Model 121

sented in percentage. Homology is a qualitative measure, two or more seuences


are homologous if they are descendant from a common ancestor or they share a
common ancestor in the history of evolution. similarity measure does not requre
any historical inference but homology involves historical hypothesis. Similarity
is quantifiable but homology is more qualitative - either Homologous or Non-
Homologous.

A(T/C)G

ATG ACG

Sequence ATG and ACG are homologous as they share a common ancestor
A(T/C)G (hypothetical) and their similarity is 66% (percentage of nucleotide
similarity)

Figure 14.2: Homology & Similarity

Gene Duplication, Speciation & Gene Families: Gene duplication is the


process by which a chromosome or a portion of DNA is duplicated, resulting in
an additional copy of a gene. Gene duplication is also referred to as chromo-
somal duplication or gene amplification. Duplication, which means to double,
results in two identical genes. One or both of these genes may change over time
through mutations to create two new different genes.

Figure 14.3: Gene Duplication-Deletion & Speciation

Speciation is the evolutionary process for formation of new species by the


122 14. Phylogenetic Analysis

division or divergency of a single species into two or more species. With the
progress of speciation, duplication and deletion may happen repeatedly and in-
dependently in each species.

S0: CAGT

deletion(C) deletion(T)

AGT CAG

duplication(T) duplication(G)

S1:AGTT S2: CAGG

Species-0 (S0) diverged into Species-1 and Species-2 through speciation and in
the progrss of speciation there happened deletion and duplication of gene, that
finally result into two different species Species-1 and Species-2

Figure 14.4: Gene Duplication-Deletion & Speciation Example

Gene families are composed of homologous genes that share a common an-
cestor. Each is the result of an evolution process involving gene duplication,
speciation and gene deletion.

14.3 Phylogenetic Tree


Phylogenetic Tree is a graphical representation of evolutionary or ancestral re-
lationships amogn two or more genes or organisms in the form of tree (a graph
where there is exactly one path between any two nodes).

Description & Features of Phylogenetic Trees: Like other trees of


graph-theory, phylogenetic tree consists of nodes and branches. There may
have three types of nodes - Leaf Nodes, which are the outmost terminal nodes
of the tree. typically leaf nodes represent organisms or molecules that are be-
ing compared for constructing phylogenetic tree. These nodes are also known as
Operational Taxonomic Unit (OTU). The internal nodes of the tree usually rep-
resent an inferred common ancestor that introduces two independent lineages
at some point in the past. These nodes are hypothetical ancestors and usually
termed as inferred ancestors. Root Node is the base of the tree, root represents
the last common ancestor of all of the organisms or molecules under comparison.
It is not always possible to determine root correctly only from molecular data.
It also needs a good amount of related historical and physical data (e.g., fossil
records) or data about organisms beyond the data of the tree.
14.4. Types of Phylogenetic Trees 123

Ancestor(Root)

Inf Anc1 Org3

branch

Org1 Org2 (Leaf)

Inf AncX: Inferred Ancestor, OrgX: Organism or Species or Sequence

Figure 14.5: Phylogenetic Tree Description

Connections between the nodes are called branches. Branches represent


evolutionary pathways and relationships among ancestor, inferred ancestors and
OTU. Historial time scale and other evolutionary divergency measurements can
also be presented by the branches. In this case length of the branch is significant
to maintain a scale. Each tree is binary as it is usual to present evolution
of species as a series of bifurcations, although there may have mutifurcating
ancestral node. Multifurcating nodes can be interpreted as either - an ancestral
population simultaneously gave rise to three or more independent lineages or
tow or more bifurcations occured at some point in the past but limitations in
the data available make it impossible to distinguish the order in which they
happended.

14.4 Types of Phylogenetic Trees


According to structures and purposes of the phylogenetic tree, there are several
types of trees discussed bellow.

Rooted & Unrooted Trees: A rooted phylogenetic tree present the in-
ference about a common ancestor. It has a ”base” node being the ancestor of
all the organisms under cover. The direction of evolution and pathways can
be achieved from ancestor to organisms. On the other hand the unrooted tree
only stablishes relationships among OTUs but does not specify evolutionary
pathways. Roots can be asigned to unrooted trees by finding an outgroup. An
outgroup is a species that has unambiguously seperated much earlier from other
species under cosideration.

The possible number of phylogenetic tree, both for Rooted and Unrooted
trees, grows exponentially with the following equations. Though there are stag-
124 14. Phylogenetic Analysis

Primate

Gorilla

Chimpanzee Human

Rooted tree for Three Great Apes

Figure 14.6: Rooted Tree

Gorilla

Human Chimpanzee

Human, Chimpanzee and Gorilla have relations among them, but in the figure
no evolutionary pathways are defined

Figure 14.7: Unrooted Tree

gering number of possible phylogenetic trees even for a small set of data (species,
organisms, sequences), only one of these tree is the true phylogenetic tree! This
is the real challenge. Only the molecular data may never infer the correct phy-
logenetic tree, more artificial, morphological, and historical data are needed to
14.5. Approaches in Phylogenetic Analysis 125

reach a most-possible-accurate tree, which may be called best-probable-inferred


tree.

Number if possible rooted phylogentic trees with n OTUs(ternimal nodes):

NR = (2n − 3)!/(2(n−2) (n − 2)!) (14.1)

And number if possible unrooted phylogentic trees with n OTUs(ternimal


nodes):
NU = (2n − 5)!/(2(n−3) (n − 3)!) (14.2)

Number Number of Rooted Trees, Number of Unrooted


of NR Trees, NU
OTUs, n
2 1 1
3 3 1
4 15 3
5 105 15
10 34,459,425 2,027,025
15 213,458,046,767,875 7,905,853,580,625
20 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875
Table 14.1: Number of Possible Rooted & Unrooted Phylogenetic
Trees for Different Number of OTUs

14.5 Approaches in Phylogenetic Analysis


The very first step of any kind of phylogenetic analysis is to construct the
optimal or suboptimal phylogeneti tree. In a broad sense, there are two main
categories of approaches for building phylogenetic trees or phylogenetic analysis,
one is Phenetic(or Clustering) and another is Cladistic. And there are other
approaches, more classical, evolutionary systematic approaches.

14.5.1 Phenetic(or Clustering) Approach


This approach of building phylogenetic tree solely based on phylogenetic resem-
blance (similarities or dissimilarities) among the species or taxa on account. All
characters (from the molecular data) may be considered here but this approach
makes no reference to any historical model of the relationships. Computational
methods that are used in this approach are mainly distance-based methods,
that proceed by measure=ing a set of distances between species, and construct
the phylogenetic tree by a hierarchial clustering procedure. This is why this
approach is also called Clustering.
126 14. Phylogenetic Analysis

14.5.2 Cladistic Approach


Cladistic approach considers the possible evolutionary pathways during tree
construction. This is based on the conserved characters, species are grouped
together only with those that share the conserved characters to bring them
under common ancestor. This infers the features of ancestor at each node,
and choose an optimal tree according to some evolutionary model. Maximum
Parsimony or Maximum Likelihood based computational methods are used in
this approach. In a sentance Cladistic approach is based on genealogy where as
Phenetic is based on similarity. The basic assumption in this approach is that
changes in characteristics occurs in lineage over time throuh bifurcation activity
in the ancestor nodes.

14.5.3 Evolutionary Systematic Approaches


These are the earliest approaches when the molecular data had not been ex-
plored. Morphological, physilogical and paleontological data are used to build
up or predict the phylogenetic trees.

14.6 Methods for Phylogenetic Tree-Construction


14.6.1 Distance-based Methods
14.6.1.1 Unweighted Pair Group Method with Arithmetic Mean
(UPGMA)
Unweighted Pair Group Method with Arightmetic Mean (UPGMA) is arguably
the most popular and simplest distance-based hierarchical clustering algorithm
for building phylogenetic trees. UPGMA builds a tree based upon the molecular
clock.

It starts with the most similar pair of OTU and build a composite OTU with
these two. Now from the new group of OTUs again the pair with heighest simi-
larity is picked and composited into a single OTU. This process continues until
two OTUs are left. The tree building process starts with the initial OTUs as
leaf node and evey composite node as the next intermediate ancestral node for
the chosen pair of OTU. This new intermediate ancestral node is considered as
a new OTU. The tree building goes on until there remain only two OTUs, as
becasue the final two OTUs are the first descendants of the root ancestor.

Measure of Similarity & Distnace Matrix: The simplest method for


molecular similarity measure between a pair of sequences is ”distance”, distance
is the nuber of dissimilarities between these two sequecnes when they are aligned
(pair-wise alignment). And the pair wise distances between any two sequences
of a group of sequences (aligned by multiple sequence alignment) is presented
in a matrix format called Distance Matrix.
14.6. Methods for Phylogenetic Tree-Construction 127

Seq1 G T A G G A T
Distance = 2 l l
Seq2 G A A A G A T

The distance between Seq1 and Seq2 is 2

Figure 14.8: Distance between Two Sequences

Sequences:(Multiple Sequence Alignment)


A - GCTTGTCCGTTACGAT
B - ACTTGTCTGTTACGAT
C - ACTTGTCCGAAACGAT

A B C
A − 2 4
B − 4
C −

The Distance Matrix for Sequences A, B, C

Figure 14.9: Distance Matrix

UPGMA Algorithm: This algorithm starts with a group of sequences (xi ),


known as OTU and an initial distance matrix d. At first all OTUs are assigned
a cluster representing unconnected leaf of the untimate tree. Then it finds the
clossest pair of clusters and builds a new cluster (internal ancestral nodes of
the ultimate tree) combining this pair and removes the pair of clusters. This
iteration continues untill two cluster are left and all the clusters are resutled into
a single tree, converging the final two OTUs into the nosde, this is the principle of
heirarchical clustering algorithm. A simulation of the bellow UPGMA algorithm
has been shown in the following page.

The UPGMA clustering method is very sensitive to unequal evolutionary


rates (assumes that the evolutionary rate is the same for all branches). UPGMA
however, is a complete-linkage method, in the sense that all edges between data
points are needed in memory. Due to this prohibitive memory requirement
UPGMA is not scalable for very large datasets.
128 14. Phylogenetic Analysis

Figure 14.10: Test


14.6. Methods for Phylogenetic Tree-Construction 129

Algorithm 2 UPGMA
1: Initialization:
2: Assign each xi into its own cluster Ci
3: Define one leaf per sequence, height 0
4: Iteration:
5: Find two clusters Ci , Cj such that. dij is minimum
6: Let Ck = Ci ∪ Cj
7: Define node connecting Ci , Cj , height dij /2
8: Delete Ci , Cj
9: Termination:
10: When all sequences belong to one cluster

14.6.1.2 Neighbor Joining Algorithm(NJ)


14.6.1.3 Fitch-Margobiash (FM) Method
In 1967, Fitch and Margoliash proposed a dynamic programming based algo-
rithm for fitting trees to distance matrices. This method seeks the least squared
fit of all observed pair-wise distances to the expected distance of a tree. Sim-
ply, the goal of this method is to position neighbours correctly and to calculate
branch lenghts that reflect the original data. The detail algorithm and its clssi-
ifcation will be discussed in the ”Computational Approaches”-Chapter. From
very top-line the steps of this method are presented bellow. FM method per-
forma best in the group of distance-based methods, but they work much more
slowly than Neighbor Joining Algorithm (NJ), which generally yield a very close
tree to these methods.

Steps for Fitch-Margobiash (FM) Method: Fitch-Margoliash method


starts with chosing the two closest OTUs as terminal node (taxa) and then
creating a third, hypothetical OTU that is essentially ”all the rest nodes(taxa)”.
And this process continue to build the tree.

• Find the most closely-related pair of sequences(taxa/OTU) using Distance


Matrix (let A and B), and cluster the rest of the sequences as a single node,
X.

• Calculate the average distance from A to all other sequences (of cluster
X), and from B to all other sequences (of cluster X).

• Adjust the position of the common ancestor node for A and B, so that
the difference between the averages is equal to the difference between the
A and B branch lengths, while the sum of the branch lengths is distance
between A and B( d(A, B)).

• Repeat as necessary until X has only one OTU in it.


130 14. Phylogenetic Analysis

14.6.1.4 Minimum Evolution (ME) Method

14.6.2 Character-based Method


14.6.2.1 Maximum Parsimony (MP) Method
Maximum Parsimony is perhaps the most popular character-based cladistic
method. This method predicts the phylogenetic tree that results the minimum
number of changes in characters (mutations) needed to explain the observed
variations in the sequences. So the maximum parsimony method constructs the
most parsimonius tree in terms of mutation (change) score among all the pos-
sible choices. The bellow example illustrates the Maximum Parsimony method
in the simplest way.

Lets have four sequences under consideration-


ATCG
ACCG
ATCG
ACCG

A(T/C)CG A(T/C)CG

[T → C]

ATCG ACCG A(T/C)CG A(T/C)CG


[T → C] [T → C]
ATCG ATCG ACCG ACCG ATCG ACCG ATCG ACCG
Among the above two trees, the left one requires only one mutation [T → C]
where as the right one takes two mutations, Maximum Parimony will choose
the left one as an optimal tree

Figure 14.11: Choice of Maximum Parsimony

The concept of maximum parsimony has evolved from two assumptions, one
is that mutations are exceedingly rare events in the evolutionary pathways. And
another is that the more unlikely events a model invokes, the less likely the model
is to be correct. As a result, the relationship that requires the fewest number
of mutations to explain the current state of the sequences being considered the
relationship that is more likely to be correct. To establish this conservative
principle of minimum evolution the maximum parsimony model for phyloge-
netic tree has been postulated.

The pre-step for maximum parsimony method is to align the sequences using the
multiple sequence alignment methods. Every sequence position of the alignment
(aligned column) is called a site. For each aligned site (position), phylogenetic
tree that require the smallest number of evolutionary changes to produce the
observed sequence changes are identified. This analysis is continued for every
14.6. Methods for Phylogenetic Tree-Construction 131

site of the alignment. Finally, those trees that produce the smallest number of
changes overall for al sequences are indentified as the maximum parsimony tree.

—** Figure-

The Fitch’s Algorithm (W. Fitch, 1971) is widely used method for constructing
phylogenetic tree for maximum parsimony method. As the possible number of
trees is very large even for a small set of sequences, enumerating all the trees
then scoring them and finding the most parsimonius tree is really impractical.
That is why there are used several heuristics like Branch-and-Bound method,
Nearest-Neighbour Interchange method etc are used to narrow down the search
space to find the optimal or suboptimal tree rather that having the exact solu-
tion.

Weighted & Unweighted Parsimony: The primary assumption - ”Muta-


tions are Rare” for parsimony approach. But the assigning the equal probability
of mutation for all sequences and events is very simplistic way of thinkning, A
parsimony will be named as Unweighted Parsimony if the mutation is assumed
to be with equi-probability in all sequences and events. But in reality, insertions
and deletions are less likely than exchanging one nucleotide for another, long
insertions and deletions are less common that shorter ones, some substitutions
are more likely than others, mutations with functional consequences are less
likely that inconsequential mutations. Considering these, some relative likeli-
hood of each these kinds of mutations can be assigned, And this model is called
weighted parsimony.

14.6.2.2 Maximum Likelihood (ML) Method

Maximum Likelihood (ML) is the most established and generalized character-


based method for the inference of phylogeny. This method reconstruct a phy-
logeny using an explicit model of evolution. The maximum likelihood method
assigns quantitative probabilities (from substitution rate matrix) to mutational
events, rather than merely counting the mutations only. Like maximum parsi-
mony, maximum likelihood builds the tree, but it assigns branch length based
on the probabilities of the mutational events postulated. The method searches
for the tree with the highest probability or likelihood.

There are some adavantages of maximum likelihood methods over other meth-
ods. Maximum likelihood methods show lower variance than other methods,
they are robust and statistically well founded. This method works well for
distantly related sequences even for different molecular clock theory and can
incorporate any desirable evolutionary model. Overall this method is the most
flexible and shows good results under good Evolutionary Models. But the main
132 14. Phylogenetic Analysis

disadvantages are that it gives bad Approximation under bad Evolutionary Mod-
els and this is a computationally intensive method.

14.7 Phylogenetic Analysis Tools


• PHYLIP
• FITCH
• NEIGHBOR
• FACTOR

• DRAWGRAM/DRAWTREE
• CONSENSE
• etc...
Chapter 15

Protein Folding

—Saddam Hossain

Protein Folding . . . Protein Folding . . . Protein Folding . . . Protein


Folding . . . Protein Folding . . . Protein Folding . . . Protein Folding . . .
Protein Folding . . . Protein Folding . . . Protein Folding . . . Protein Folding
. . . Protein Folding . . . Protein Folding . . . Protein Folding . . . Protein
Folding . . . Protein Folding . . . Protein Folding . . . Protein Folding . . .
Protein Folding . . . Protein Folding . . . Protein Folding . . . Protein Folding
. . . Protein Folding . . . Protein Folding . . . Protein Folding . . . Protein
Folding . . . Protein Folding . . . Protein Folding . . . Protein Folding . . .

15.1 Proteins

Proteins are the basis of cellular and molecular life. Proteins play a crucial
role in virtually all biological processes with a broad range of functions. Amino
acids (aa) are the building blocks of protein.There are 20 natural amino acids
(ACDEF GHIKLM N P QRST V W Y ). Protein is a linear combination of these
amino acids joined by peptide bonds.

The activity of an enzyme or the function of a protein is governed by the three-


dimensional structure. The amino acid side chains (R) determine the structure
of the protein.

Virtually all soluble proteins feature a hydrophobic core surrounded by a hy-


drophilic surface. But, peptide backbone is inherently polar.

133
134 15. Protein Folding

15.2 Protein Classification


15.3 Protein Folding
A protein folds into a unique 3D structure under the corresponding physiolog-
ical conditions. Protein folding is the total process of getting structure from
its sequence for proteins. And in reality protein is polypeptide folded into an
active shape.

Different sequence 6= Different structure

There are many protein folding algorithms, all the brute force algorithms are
NP-complete. The practical algorithms are approximate algorithms with poly-
nomial time and close to true result with high probability. And these are not
stochastic.

15.4 Protein Structure


Though we think of pretein structures as static entities, they are dynamic.
Structure of a protein is very important to know because the functional model
of a protein depends on its structure (3D Structure). In the modern era the most
important application of this structure information is to model drug (medicine).
Proteins have very complex and compact 3D shapes or structures. There are
four major levels of structures of proteins. These are primary, secondary, tertiary
and quaternary structures.

15.4.1 Primary Structure


The primary structure is the simple linear sequence of covalently bound amino
acids or simply the sequence of amino acids.

15.4.2 Secondary Structure


Secondary structure of a protein is the folding or coiling of its polypeptide
chains. The most commonly observed conformations in secondary structures
are α-Helix, emphβ-Sheets/Strands, Loops/Coils/Turns. The structural type is
usually given from the dihedral angles along 3 residues. Stable and well defined
secondary structure segments strongly influence the chain’s folding.

15.4.2.1 α-helix
This structure repeats itself evry 5.4 Angstroms along the helix axis. Every
main chain CO and NH group is hydrogen bonded to a peptide bond 4 residues
away. In accurate, there gets a α-helix turn per every 3.6 residues. This struc-
ture mainly found on the Protein Surfaces.
15.5. Experimental Techniques for Structure Determination 135

Not Proline & Glycine

15.4.2.2 β-sheets

β-sheets structure is formed as two or more polypeptide chains run alongside


each other and are linked by hydrogen bonds. This can be Parallel or Antipar-
allel or Mixed in the nature of the running peptide chains directions.

Alternating side-chains, No mixing, Loops often have polar amino acids.

15.4.3 Tertiary Structure


A conformed structured protein is composed of a combination of secondary
structures. A protien’s tertiary structure describes how the protein molecule
folds in 3D space with all its secondary and primary structures.

15.4.4 Quaternary Structure


The quaternary structure of protein describes the comples configuration of a
protein that is interacting with other molecules in 3D space.

15.5 Experimental Techniques for Structure De-


termination
There are few traditional experimental methods, among those the most widely
used methods to solve protein structures are X-ray Crystallography, Nuclear
Magnetic Resonance spectroscopy (NMR), and Electron Microscopy/Diffraction.
But the major drawbacka for these methods are these are expensive and slow.
They can generate a few structures per day worldwide, obviously this through-
put cannot keep pace for new protein sequences that are being explored and
published every day.

15.5.1 X-ray Crystallography


• From small molecules to viruses

• Information about the positions of individual atoms

• Limited information about dynamics

• Requires crystals
136 15. Protein Folding

15.5.2 Nuclear Magnetic Resonance spectroscopy (NMR)


• Limited to molecules up to 50kDa (good quality up to 30 kDa)
• Distances between pairs of hydrogen atoms
• Lots of information about dynamics
• Requires soluble, non-aggregating material
• Assignment problem

15.5.3 Electron Microscopy/Diffraction


• Low to medium resolution
• Limited information about dynamics
• Can use very small crystals (nm range)
• Can be used for very large molecules and complexes

15.5.4 Free electron lasers

15.6 Protein Structure Classification


Class, four types
• Mainly α
• α/β structures
• Mainly β
• No secondary structure

Arhitecture (fold)

Topology (superfamily)

Homology (family)

15.6.1 Two types of algorithms


• Inter-Molecular, 3D, Rigid Body ; structural alignment in a common co-
ordinate system (hard) e.g. VAST, LOCK.. alg.
• Intra-Molecular, 2D, Internal Geometry ; structural alignment using in-
ternal distances and angles e.g. DALI, STRUCTURAL, SSAP.. alg.
Based on this similarity score and some specified gap penalty, dynamic pro-
gramming is used to find the optimal structural alignment
15.7. Protein Structure Prediction 137

15.7 Protein Structure Prediction


Certain level of function can be found without structure. But a structure is a
key to understand the detailed mechanism. A predicted structure is a powerful
tool for function inference.

A schematic view of how to proceed from the sequence to a model of the protein
is presented bellow. Prediction of structure relies heavily on different alignment
methods. Obtaining a reliable alignment with a known structure determines
which methods to be used.

—****Fig-Protein Structure Prediction Flow****—

The end-to-end process for prtein structure prediction from its primary struc-
ture (amino acids sequence) can be thought into several stages. Though the
initial attempts were to predict a 2D structure for protein, now the 3D struc-
ture is the final requirement.

Most of the prediction methods are based on primary sequence only with accu-
racy 64% -75%. The prediction accuracy is higher for α-helices than β-strands.
Accuracy is dependent on protein family and predictions of engineered proteins
are less accurate.

15.7.1 Stages of Protein Structure Prediction


Stage 1: Backbone Prediction Methods for backbone prediction
• Ab initio folding
• Homology modeling
• Protein threading

Stage 2: Loop Modeling

Stage 3: Side-Chain Packing Side-Chain Packing Problem can be de-


scribed as - given the backbone coordinates of a protein, predict the coordinates
of the side-chain atoms.

Method: decompose a protein structure into very small blocks

Bottom-to-Top:
Calculate the minimal energy function

Top-to-Bottom:
Extract the optimal assignment
138 15. Protein Folding

Time complexity:
Exponential in tree width, linear in graph size

Stage 4: Structure Refinement Why is structure prediction and especially


ab initio calculations hard..?

• Many degrees of freedom / residue

• Remote noncovalent interactions

• Nature does not go through all conformations

• Folding assisted by enzymes & chaperones

Some Concepts:

• Class : Secondary structure content

• Fold : Major structural similarity.

• Superfamily : Probable common evolutionary origin.

• Family : Clear evolutionary relationship.

Search sequence data banks for homologs, Search methods e.g. BLAST,
PSIBLAST, FASTA, Homologue in PDB..

Multiple sequence / structure alignment

• Contains more information than a single sequence for applications like


homology modeling and secondary structure prediction

• Gives location of conserved parts and residues likely to be buried in the


protein core or exposed to solvent

15.8 Secondary & Tertiary Structure Prediction


Methods
The entire information for forming secondary structure is contained in the pri-
mary sequence of the corresponding protein and side groups of residues deter-
mine structure. There may have eight types of secondary structures, these are
α-helix (H), residue in isolated β-bridge (B), extended strand participating in
β-ladder (E), 3-helix (3/10 helix) (G), 5-helix (π-helix) (I), hydrogen bonded
turn (T), bend (S), and random coil (C). These types can be grouped into three
major category, (i) α-helix (H) consisting HandG (ii) β-structures (E) consist-
ing BandE, and (iii) Coils encompasing I, T, S, C. The consequent tertiary
structure is the final confromational folding of protein in the 3D space. The
15.8. Secondary & Tertiary Structure Prediction Methods 139

primary step for tertiary structure prediction is the Secondary Structure Predic-
tion. Protein Secondary structure prediction methods predict these structures
from its primary structure (more roughly from its amino acids sequence). Now
a days, the available prediction methods have reached an averaged accuracy of
more than 70%.

There are two main computational alternatives beyond the experimental meth-
ods to determine or predict the secondary structures of proteins. The first
alternative is ab-initio methods and second is approximate or heuristic meth-
ods. And again the widely used approximate or heuristic methods for secondary
protein structure prediction can be categorized as Statistical Methods, Nearest
Neighbor Approach, Neural Networks Approach, Hidden Markov Model, and Sup-
port Vector Machine based methods.

Usually examining windows of 13 - 17 residues is sufficient to predict struc-


ture.

15.8.1 Ab-initio Method


The conformed structure of protein is guided through some energy functions.
These energy functions describe the protein calculating the free energy which is
vastly contributed by Bond Energy, Bond Angle Energy, Dihedral Angle Energy,
Van Der Waals Energy, Electrostatic Energy, etc. Correctly folded proteins have
only marginally less free energy than misfolded proteins. Ab-initio mehtod min-
imizes these energy functions to obtain the structure. Simulated Annealing is
one of algorithms used in Ab-initio mehtod. This method is computationally
very expensive and accuracy is still poor. It performs good for smaller prob-
lems. Though, Ab-initio mehtod is not practical yet, usually used to refine
models suggested by other algorithms or methods.

The ab-initio methods determine protein structure based on sequence data (pri-
mary structure) abd the physics of molecular dynamics. Physics of molecular
dyanamics consists of Newtonian physics, atomic level forces, bond lengths,
bond angles, torsion (dihedral) angles, and equations for calculating energy for
the most stable (minimum free energy) conformation or structure. These func-
tions can also depend on amino acid sequence, the temperature, presssure, pH
and other local conditions. And the functions of angles also depend on the types
of atoms involved and the number of free electrons available for bonding.

The ab-inito methods start with sequence data or primary protein structure.
Then it constructs a reasonable secondary structure by using bond lengths,
angles and torsion angles. On the next step, it populates a library of tertiary
structures by generating all possible candidate tertiary structures using molecu-
lar dynamics and Monte-Carlo methods. Monte-Carlo methods indetify confor-
mational combinations with lowest free energy. After that, from this library of
3D structure candidates, the best possible structures are filtered using Metropo-
140 15. Protein Folding

lis algorithm, which identifies the most stable molecular conformations. This
method is based on the assumption that the native conformation of a protein is
the conformation with the lowest free energy. When the top-most candidates are
selected, they are visualized and validated against the corresponding structures
calculated from the experimental methods like NMR or X-ray crystallography.
The protein structures are compared with the Root Mean Squared Deviation
(RMSD) measure.

15.8.2 Statistical Method (old fashioned)


The Chou & Fasman Method This method has been developed by Chou
& Fasman in mid 1970s (1974) based on frequencies of residues in α-helices (H),
β-sheets (E) and turns. This method has an accuracy level between 50 - 60%.
This is first widely used procedure.

Improved Chou-Fasman It assigns all of the residues the appropriate


set of parameters to identify α-helix and β-sheet regions. If structures overlap
it compares average values for P(H) and P(E) and assigns secondary structure
based on best scores. Turns are modeled as tetrapeptides using 2 different
probability values.

GOR Method Garnier, Osguthorpe & Robson proposed the GOR method.
It assumes that amino acids up to 8 residues on each side influence the secondary
structure (SS) of the central residue. This can correctly predicts upto 64%.

15.8.3 Nearest Neighbor Approach


15.8.4 Neural Network Approach
Neural Network (NN) is one of the most efficient machine learning techniques in
the analysis of biological sequences and structure predictions. The strength of
NN is that no rules about the problem being studied needs to be incorporated
in the model. The network can extract the rule (relation between input and
output) from a set of representative sequences. The network is trained using
sequence patterns/ profiles whose structure is known. The query sequence is
then input and its output value calculated from the NN. For a pattern similar to
the training set, network recalls correct output. For a pattern not seen before,
network attempts to generalize and provides an approximate result.

Normally there are three output nodes in the NN model for secondary struc-
ture prediction, each representing a class of the secondary structure. Recently,
a hybrid neural network model is used for predicting three type of secondary
structures all along. Most of the NN model works on the fragment libraries
which consist protein sequence fragments of known strucutres and the model
predicts having knowledge form that fragment databases/libraries.
15.9. Performance of Structure Prediction Approaches 141

Hybrid Fuzzy Neural Network can be used for protein secondary structure pre-
diction.

Input: a number of protein sequences + secondary structure.

Output: a trained network that predicts secondary structure elements.

Selection of training sets is extremely important for the quality of prediction.


A well trained NN can predict secondary structure with 70-75% accuracy.

15.8.5 Hidden Markov Model


15.8.6 Support Vector Machine based methods

15.9 Performance of Structure Prediction Ap-


proaches
The methods and algorithms used in protein secondary structure prediction fall
into four major classes. These are i) ab-initio methods, ii) Fold Recognition
based methods, iii) methods that recognise commonalities in folds, and iv)
Homology based methods. To measure and compare the performance of all
these methods on the same platform and scale we need to stablish some key
performance indicators. Prediction Accuracy is one of the indicators, which can
be defined as bellow.

correctly predicted residues


the Q3 test,Q3 = (15.1)
number of residues
Some other indicators are the required Sequence Similarity and Template
Coverage for the input (Primary Structure) to the model. Resolution Accuracy
is the measure of level of accurate structural detail, on the order of Å. The
models’ Runtime and Dificuly Level for model implementation are also very
important indicators for performance evaluation.

The ab initio-methods use the energy functions to guide its search and explore
structure spcae to predict the secondary structures. These techniques are useful
to predict novel structures for which comparisons against known structures do
not yield useful information. As a result these start with less structural (tem-
plate coverage) and sequence similarity (homology) information These meth-
ods can predict structure with higher level of difficulty (computational model)
and higher runtime with a resolution accuracy of 5 − 20Å. Fold Recognition
(FR) based models dont use homology based comparison rather these use Fold-
Recognition (FR) alone with better running time, and resolution accuracy and
lower level of difficulty than those of ab initio-methods. The fold commonali-
ties recognition based models need higher similarity and template coverage, but
produces higher resolution accuracy. And the mostly used methods, Homology
142 15. Protein Folding

based methids, need higher level of similarity (>30%) and template coverage to
predict structure based on the known structure library, their running time very
lower and accuracy is higher.

Method- Sequence Template Prediction Resolution Dificulty Computational


Category Similar- Cover- Accu- Accu- Level Run-
ity age racy racy time

Homology >30% >90% ? 1 − 3Å Trivial Seconds

Fold Sim- 20-30% >75% ? 2 − 5Å Easy Minutes


ilarity

Fold <20% >50% ? 3 − 10Å Moderate Hours


Recogni-
tion

ab initio <10% 0 ? 5 − 20Å Hard Days

15.10 Protein Databases


15.10.1 Structural Classification Databases
• SCOP - Structural Classification of Proteins
• CATH - Class Architecture Topology Homology (SSAP Algorithm)
• FSSP - Family of Structurally Similar Proteins (DALI Algorithm)
• PClass - Protein Classification (LOCK and 3Dsearch Algorithm)
Chapter 16

Structural Bioinformatics &


Drug Discovery

—Saddam Hossain
Structural Bioinformatics & Drug Discovery . . . Structural Bioinformatics
& Drug Discovery . . . Structural Bioinformatics & Drug Discovery . . .
Structural Bioinformatics & Drug Discovery . . . Structural Bioinformatics
& Drug Discovery . . . Structural Bioinformatics & Drug Discovery . . .
Structural Bioinformatics & Drug Discovery . . . Structural Bioinformatics
& Drug Discovery . . . Structural Bioinformatics & Drug Discovery . . .
Structural Bioinformatics & Drug Discovery . . .

Drug Discovery and Drug Design


Drug Discovery (DD) is one of the primary and most promising fields of ap-
plication for Bioinformatics research worlwide. Drug Discovery and Design is
the process by which drugs are discovered and designed. The process of drug
discovery involves the identification of candidates, synthesis, characterization,
screening, and assays for therapeutic efficacy. So this is the end-to-end process
of designing pharmaceutical agents to prevent and/or cure human disease based
on their biological targets. This is a very costly and time-consuming process,
often taking as many as 15-20 years, at a cost of $700-$800 million. Func-
tional Genomics, Structural Bioinformatics and Proteomics can always have a
synergy to speed up this process also lowering the total cost for discovery and
development process.

Drug: Drug is a molecule of a defined composition with a pharmacological


effect in a sense that it interacts with a target biological molecule in the body and
through such interaction a physiological effect occurs. Drug can be beneficial or
harmful depending on their effect, but usually drug is developed to be beneficial

143
144 16. Structural Bioinformatics & Drug Discovery

to treat diseases, especially for humans. Drugs are chemical compound, specially
small-compounds, some are large, with some specific characteristics such as it is
safe, effective, deliverable, available, stable and novel. Drugs must be regulated
by the Food and Drug Administration (FDA).

16.1 Traditional Methods of Drug Discovery


The very primitive and the most ancient method of drug discovery is exploring
the drug from the natural specially from plants and other natural products.
For example, foxglove used to treat congestive heart failure. Foxglove contain
digitalis and cardiotonic glycoside. In ancient period plant-derived drugs were
used in treatment for specific illness/ailments. Time passed, science evolved in
its own pace. And this ancient method of drug dicovery came into lab, identifi-
cation and isolation of active compounds started, even sysnthesis of compounds
became easy in the lab, and turned into the traditional method of drug discovery
in corporation of the deevlopment of Chemistry and knowledge of different com-
pounds. Even more manipulation of chemical structure to get better drug with
greater efficiency and fewer side effects got started. This traditional method
is also called empirical method of drug discovery. The empirical method is a
blind hit-or-loose method. It is based on screening thousands of chemical com-
pounds (from chemical libraries) on a disease model, mostly in vivo style, and
being tested on the disease without even knowing the target on which the drug
acts and the mechanism of action. This method works as ”black box” method.
Accidentally or Occassionally some serendipitous discoveries like discovery of
Penicillin take place using this method. And identification of active component
has also speeded up this process.

16.2 Modern Methods of Drug Discovery


Modern Drug Discovery process begins with a disease rather than a treatment
like traditional method. This approach also called Rational Drug Design ap-
proach. The industry now has the research tools to pursue rational Drug Design
successfully, but a new hurdle is being raised: finding a way to speed up the
process in more cost effective manner. Bioinformatics can play great role in this
regard. And structural bioinformatics helps the Structure-based rational drug
design accelerate for drug discovery.

The modern or rational drug discovery process can be divided into three major
parts, they are Exploratory Phase, Drug Discovery Phase and Drug Development
Phase. During the exploratory phase, Target Identification & Identification and
Target Validation are done. Later in the discovery phase, Assay Development,
Lead Identification, Lead Development, Screening and Hits to Leads, Lead Op-
timization are carried out. Development phase consists of Drug Development,
Drug Testing, Preclinical Development, Clinical Trials, Drug Toxicology and fi-
16.3. Structural Bioinformatics 145

nally NDA and New Drug to Market. The clinical trials phase may have different
phases like clinical trials I, clinical trials II, clinical trials III, etc. This is really
a very long process of rational drug discovery starting from target identification
ending to commercialization to market. Sometimes it takes 15-20 years to com-
plete this, sometimes even more. And it may cost $700-$800.

**************Fig-Drug Discovery Phases whit timelines**********

16.3 Structural Bioinformatics


Structural Bioinformatics (SBI) is a subset of bioinformatics concerned with the
use of biological structures - proteins, DNA, RNA, ligands etc. and complexes
thereof to further our understanding of biological systems. Structural Bioin-
formatics is the first major effort to show the application of the principles and
basic knowledge of the larger field of bioinformatics to questions focusing on
macromolecular structure, such as the prediction of protein structure and how
proteins carry out cellular functions, and how the application of bioinformatics
to these life science issues can improve healthcare by accelerating drug discovery
and development.

Structural Bioinformatics in Drug Design and Discovery: Structural


Bioinformatics can be used to examine drug targets (which are usually proteins)
structurally. And this will help study of binding of ligands (protein-ligand dock-
ing). These two are the very basics for rational drug design. And Structural
Bioinformatics can help the whole process with the power of computation by
saving money and time as well.

16.4 Bioinformatics and Drug Discovery Pipeline


The processes of designing a new drug using bioinformatics tools have open a
new area of research. However, computational techniques assist vastly in search-
ing drug target and in designing drug in silco. Bellow the steps of modern drug
discovery process are discussed briefly along with the scope and contribution of
Bioinformatics in the process. Though bioinformatics has a revolutionary con-
tribution in target identification and target validation, it also plays great roles
in other steps of drug discovery.

16.4.1 Target Identification and Selection


It is necessary to know all about the disease and existing or traditional remedies
for that. It is also important to look at very similar afflictions and their known
treatments. Target identification alone is not sufficient in order to achieve a
successful treatment of a disease. A real drug needs to be developed. This drug
must influence the target protein in such a way that it does not interfere with
146 16. Structural Bioinformatics & Drug Discovery

normal metabolism. Bioinformatics methods have been developed to virtually


screen the target for compounds that bind and inhibit the protein.

16.4.1.1 Types of Targets


Most of the targets are biomolecules. They can be enzymes, receptors or ion-
channels. Target enzyme is also a protein that is necessary for the survival of the
pathogen or its activity controls the disease’s causes and severity. Receptor is
another type of target, this is usually a protein. Drug molecule binds the recep-
tor to cause biological effects,it is also called lock and key system. So structure
determination of receptor is very important to design the drug molecule.

When the target is confirmed, modulators of the target can be identified. There
are two types of modulators for each kind of target, they are positive modulators
and negative modulators.

Target Positive Modulator Negative Modulator


Enzymes Activators Inhibitors
Receptors Agonists Antagonists
Ion Channels Openers Blockers
Table 16.1: List of Modulators

16.4.2 Target Validation


16.4.3 Assay Development
16.4.4 Lead Identification
16.4.5 Lead Development
16.4.6 Screening and Hits to Leads
16.4.7 Lead Optimization
When a promising lead candidate has been found in a drug discovery program,
the next step (a very long and expensive step) is to optimize the structure and
properties of the potential drug. This usually involves a series of modifications
to the primary structure (scaffold) and secondary structure (moieties) of the
compound. This process can be enhanced using software tools that explore
related compounds (bioisosteres) to the lead candidate. Lead optimization tools
such as WABE offer a rational approach to drug design that can reduce the time
and expense of searching for related compounds.
16.5. High-Throughput Screening (HTS) 147

16.4.8 Drug Development


16.4.9 Drug Testing
Once a drug has been shown to be effective by an initial assay technique, much
more testing must be done before it can be given to human patients. Animal
testing is the primary type of testing at this stage. Eventually, the compounds,
which are deemed suitable at this stage, are sent on to clinical trials. In the
clinical trials, additional side effects may be found and human dosages are de-
termined.

16.4.10 Preclinical Development


Preclinical development is defined in many pharmaceutical companies as the
process of taking a new chemical lead through the stages necessary to allow
it to be tested in human clinical trials, although a broader definition would
encompass the entire process of drug discovery and clinical testing of novel drug
candidates.

16.4.11 Drug Toxicology


16.4.12 Clinical Trials
16.4.13 NDA and New Drug to Market

16.5 High-Throughput Screening (HTS)


Drug companies now have millions of samples of chemical compounds. High-
throughput screening can test 100,000 compounds a day for activity against a
protein target. Maybe tens of thousands of these compounds will show some
activity for the protein. The chemist needs to intelligently select the 2 - 3 classes
of compounds that show the most promise for being drugs to follow-up.

16.6 Ligand-based Drug Design


Search a lead ocompound or active ligand. Structure of ligand guide the drug
design process.

16.7 Computer Aided Drug Design (CADD)


Drug design is a three-dimensional puzzle where small drug molecules, ligands,
are adjusted to the binding site of a protein. The factors which affect the protein-
ligand interaction can be characterized by using molecular docking and different
quantitative structure-activity relationships (QSAR) methods. This is the most
commonly used tool to model biological system is molecular dynamics. CADD
works on the model of a receptor refined with molecular dynamics simulations.
148 16. Structural Bioinformatics & Drug Discovery

Virtual screening is a computational technique to find novel drug candidates.


Data from virtual screening can be used to develop predictive models in order
to optimize ADMET properties of the candidate molecules. The ultimate goal
of this procedure is to find investing lead molecules that are worth for further
drug research and synthesis.

16.8 Quantitative Structure Activity Relation-


ships (QSAR)
Compute functional group in compound. QSAR compute every possible num-
ber. Enormous curve fitting to identify drug activity. Chemical modifications
for synthesis and testing.

16.9 Individual Drug Discovery


Part III

Introduction to
Bioinformatics
Computations

149
151

Introduction to Bioinformatics Computations . . . Introduction to Bioin-


formatics Computations . . . Introduction to Bioinformatics Computations .
. . Introduction to Bioinformatics Computations . . . Introduction to Bioin-
formatics Computations . . . Introduction to Bioinformatics Computations .
. . Introduction to Bioinformatics Computations . . . Introduction to Bioin-
formatics Computations . . . Introduction to Bioinformatics Computations .
. . Introduction to Bioinformatics Computations . . . Introduction to Bioin-
formatics Computations . . . Introduction to Bioinformatics Computations .
. . Introduction to Bioinformatics Computations . . . Introduction to Bioin-
formatics Computations . . . Introduction to Bioinformatics Computations . .
.
152
Chapter 17

Statistical and Probabilistic


Methods in Bioinformatics

—Saddam Hossain

Statistical and Probabilistic Methods in Bioinformatics . . . Statistical and


Probabilistic Methods in Bioinformatics . . . Statistical and Probabilistic
Methods in Bioinformatics . . . Statistical and Probabilistic Methods in Bioin-
formatics . . . Statistical and Probabilistic Methods in Bioinformatics . .
. Statistical and Probabilistic Methods in Bioinformatics . . . Statistical
and Probabilistic Methods in Bioinformatics . . . Statistical and Probabilis-
tic Methods in Bioinformatics . . . Statistical and Probabilistic Methods in
Bioinformatics . . . Statistical and Probabilistic Methods in Bioinformatics .
. . Statistical and Probabilistic Methods in Bioinformatics . . . Statistical
and Probabilistic Methods in Bioinformatics . . . Statistical and Probabilistic
Methods in Bioinformatics . . .

17.1 Introduction

Today, many bioinformatists routinely work with considerably large datasets.


Frequently it is needed to apply different statistical and probabilistic techniques
to explore the insight of the data. And this practice is increasing as the bioin-
formatics challenges are growing every day.

153
154 17. Statistical and Probabilistic Methods in Bioinformatics

17.2 Concept of Randomness and Variability


17.3 Hypothesis Testing
17.4 Regression Analysis
17.5 Linear Discriminent Analysis
17.6 Naive Bayes Classification
Chapter 18

Computational Methods in
Bioinformatics

—Saddam Hossain

18.1 Exhaustive Search


18.2 Discrete-State Models
18.3 Evolutionary Computation
18.4 Greedy Algorithms
18.5 String Algorithms
18.6 Hybrid Computational Methods

155
156 18. Computational Methods in Bioinformatics
Chapter 19

Bioinformatics Data Mining

—Md. Towhidul Islam


Bioinformatics Data Mining . . . Bioinformatics Data Mining . . . Bioinfor-
matics Data Mining . . . Bioinformatics Data Mining . . . Bioinformatics Data
Mining . . . Bioinformatics Data Mining . . . Bioinformatics Data Mining . .
. Bioinformatics Data Mining . . . Bioinformatics Data Mining . . . Bioinfor-
matics Data Mining . . . Bioinformatics Data Mining . . . Bioinformatics Data
Mining . . . Bioinformatics Data Mining . . . Bioinformatics Data Mining . . .
Bioinformatics Data Mining . . . Bioinformatics Data Mining . . .
THIS FOLLOWING TEXT IS JUST A THOUGHT OF
MIND

19.1 What is Data Mining


There have been some efforts to define standards for data mining, for example
the 1999 European Cross Industry Standard Process for Data Mining (CRISP-
DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). These are evolving
standards; later versions of these standards are under development.

As data sets have grown in size and complexity, direct hands-on data analy-
sis has increasingly been augmented with indirect, automatic data processing.
This has been aided by other discoveries in computer science, such as

• Neural Networks(NN)
• Clustering
• Genetic Algorithms (1950s)
• Decision Trees (1960s)

157
158 19. Bioinformatics Data Mining

Figure 19.1: What is Data Mining

• Support Vector Machines (1980s)

19.2 Data Mining Task


1. Classification

2. Segmentaion/Clustering

3. Association

4. Summarization
19.2. Data Mining Task 159

Things to Ponder
It is sometimes confusing between clustering and classification. Both of them
put data examples into different groups. The difference is that in classification
,the groups are predefined and the task is to decide which group a new data
sample should belong to.

In clustering the types of groups and even the number of groups are not known
and the task is to find the best way to segment all the data.

For some methods, the number of clusters needs to be specified first. For ex-
ample, to use the K-Means algorithm.The major problem of clustering is the
decision of the number of clusters. For some the user should input the number
of clusters first. The centers of the clusters are chosen arbitrarily. Then, the
data iteratively move between clusters until they converge. If the user is not
satisfied with the results, another clusters number is then tried. So this kind of
method is a trial-and-error process.

Another solution to the decision of the number of clusters is to use hierarchical


clustering. There are two approaches for this: the top-down approach and the
bottom- up approach. For the top-down approach, the data are separated into
different clusters based on different criteria of the similarity measurement. At
first, the whole data belong to a big cluster. At last, each data sample is a single
cluster. There will be different levels of clustering in between the two. For the
bottom-up approach, the data are group into different clusters from clusters of
single data sample until all the data are of one cluster

Association: Another task of data mining is to search for a set of data within
which a subset is dependent on the rest of the set. x− > y means : if sequence
x is supposed to be present in a specific part of a gnome then so will sequence
y.

Fields Data Mining

1. Database and Data warehousing

2. Statistics

3. Machine Learning

Statistics: —Bayesian classifier.

Regression: Regression is to build a mathematical model for some known


temporal data and use the model to predict the upcoming values. This task
160 19. Bioinformatics Data Mining

would be very difficult because the ”trend” of the data is usually nonlinear
and very complicated. Many parameter estimates are involved if the model is
nonlinear. Therefore, to simplify the problem, a linear model is usually used.
A linear model is to use a straight line to estimate the trend of the data. This
is called linear regression. For a set of data with a nonlinear nature, it can be
assumed that the data trend is piecewise linear. That is, in a small period, the
data trend is about a straight line.

Machine Learning: Machine learning is a long-developed field in artificial


intelligence (AI). It focuses on automatic learning from a data set.
A suitable model with many parameters is built first for a certain
domain problem and an error measure is defined.

A learning (training) procedure is then used to adjust the parameters according


to the predefined error measure. The purpose is to fit the data into the model.
There are different theories for the learning procedure, including

• gradient decent,

• expectation maximization (EM) algorithms,

• simulated annealing, and

• evolutionary algorithms.

The learning procedure is repeated until the error measure reaches zero or
is minimized. After the learning procedure is completed with the training data,
the parameters are set and kept unchanged and the model can be used to pre-
dict or classify new data samples.

Different learning schemes have been developed and discussed in the machine
learning literature. Important issues include the learning speed, the guarantee
of convergence, and how the data can be learned incrementally. There are two
categories of learning schemes:

1. supervised learning and

2. unsupervised learning.

Supervised learning learns the data with an answer. Meaning, the parame-
ters are modified according to the difference of the real output and the desired
output (the expected answer). The classification problem falls into this category.

On the other hand, unsupervised learning learns without any knowledge of


the outcome. Clustering belongs to this category. It finds data with similar
attributes and put them in the same cluster.
19.3. Association Rules Mining 161

Various models like neural networks (NN), decision trees (DT), genetic algo-
rithms (GA), fuzzy systems, and support vector machines (SVM) have proved
very useful in classification and clustering problems. But machine learning tech-
niques usually handles relatively small data sets because the learning procedure
is normally very time-consuming. To apply the techniques to data mining tasks,
the problem with handling large data sets must be overcome.

19.3 Association Rules Mining


19.4 Decision Tree
19.5 Clustering
19.6 Classification
19.7 Fuzzy Classification
19.8 Nearest Neighbor Classification
19.9 Support Vector Machine
19.10 Pattern Recognition
19.11 Machine Learning Approaches
162 19. Bioinformatics Data Mining
Chapter 20

Some Algorithms in
Bioinformatics

—Zohirul Alam Tiemoon

20.1 BLAST
20.2 FASTA
20.3 CLUSTALW
20.4 PHD
20.5 Predator
20.6 TRILOGY
20.7 Gibbs Sampler
20.8 DALI

163
164 20. Some Algorithms in Bioinformatics
Part IV

Some Widely Used


Methods & Models in
Bioinformatics

165
167

Some Widely Used Methods & Models in Bioinformatics . . . Some Widely


Used Methods & Models in Bioinformatics . . . Some Widely Used Methods
& Models in Bioinformatics . . . Some Widely Used Methods & Models in
Bioinformatics . . . Some Widely Used Methods & Models in Bioinformatics .
. . Some Widely Used Methods & Models in Bioinformatics . . . Some Widely
Used Methods & Models in Bioinformatics . . . Some Widely Used Methods
& Models in Bioinformatics . . . Some Widely Used Methods & Models in
Bioinformatics . . . Some Widely Used Methods & Models in Bioinformatics .
. . Some Widely Used Methods & Models in Bioinformatics . . . Some Widely
Used Methods & Models in Bioinformatics . . . Some Widely Used Methods &
Models in Bioinformatics . . .
168
Chapter 21

Dynamic Programming
And Bioinformatics

—Saddam Hossain

21.1 Dynamic Programming


Dynamic Programming is one of the application of Recursion-based calcula-
tion which speeds up the calculation procedure to solve dynamic optimization
problem. In Dynamic Programming the original problem is divided into small
subproblems, and results of small subproblems are used to solve the original
problem linking the different subproblems’ results in a way that the optimal
solution in a subproblem is also optimal for the problem as a whole. . Richard
Bellman in 1953 proposed the concept and mathematical theory of dynamic pro-
gramming to calculate cost/score progressively from smaller solutions. Dynamic
programming is used to find the optimal solution of multivariable problem, exact
computation is enormous.

21.1.1 Concept of Dynamic Programming


The concept of Dynamic Programming can be perceived through an classical
example of ”The Change Problem”, where the objective is to change an amount
(coin/money) by fewest number of coins/notes from the available denomination.
For example, lets assume we have notes of 1,2,5 Tk and need to make a change
of 7 Tk with minimum number of notes from the three denominations. From
the concept of exhaustive search, all the possible combinations of notes will be
tried to find the solution. But this is not feasible in the case of large number
of variable (denomination) and big numbers (amount to change). We can also
solve the problem optimally with enumerating all combinatorial options. We
already have the best solution for 1, 2, and 5 Tk change with a single note.

169
170 21. Dynamic Programming And Bioinformatics

Initial Subproblem with optimal solution


Best Solution for 1 Tk Change = One(1) 1 Tk Note [=1*1Tk]
Best Solution for 2 Tk Change = One(1) 2 Tk Note [=1*2Tk]
Best Solution for 5 Tk Change = One(1) 5 Tk Note [=1*5Tk]

Next the change for 3 Tk can be obtained by finding the optimal solution from
the options

a. One(1) 1 Tk Note + Best Solution for 2 Tk(3-1=2) Change [=1-Note]


b. One(1) 2 Tk Note + Best Solution for 2 Tk(3-2=1) Change [=1-Note]

Either(a or b) Solution is optimal and same. So our revised solution matrix


is

Best Solution for 1 Tk Change = 1-Note One(1) 1 Tk Note [=1*1Tk]


Best Solution for 2 Tk Change = 1-Note One(1) 2 Tk Note [=1*2Tk]
Best Solution for 3 Tk Change = 2-Notes One(1) 1 Tk Note + One(1) 2 Tk
Note [=1*1Tk + 1*2Tk]
Best Solution for 5 Tk Change = 1-Note One(1) 5 Tk Note [=1*5Tk]

In the same way best solution for 4 Tk is

a. One(1) 1 Tk Note + Best Solution for 3 Tk(4-1=3) Change [=2-Notes]


b. One(1) 2 Tk Note + Best Solution for 2 Tk(4-2=2) Change [=1-Note]

solution b is best, so the progressively build up solution to this point is

Best Solution for 1 Tk Change = 1-Note One(1) 1 Tk Note [=1*1Tk]


Best Solution for 2 Tk Change = 1-Note One(1) 2 Tk Note [=1*2Tk]
Best Solution for 3 Tk Change = 2-Notes One(1) 1 Tk Note + One(1) 2 Tk
Note [=1*1Tk + 1*2Tk]
Best Solution for 4 Tk Change = 2-Notes One(1) 2 Tk Note + One(1) 2 Tk
Note [=1*2Tk + 1*2Tk]
Best Solution for 5 Tk Change = 1-Note One(1) 5 Tk Note [=1*5Tk]

If we go in the same fashion solving each subproblem progressively(6 Tk then 7


Tk), we will have the solution for 7 Tk change with the following result.

Best Solution for 1 Tk Change = 1-Note One(1) 1 Tk Note [=1*1Tk]


Best Solution for 2 Tk Change = 1-Note One(1) 2 Tk Note [=1*2Tk]
Best Solution for 3 Tk Change = 2-Notes One(1) 1 Tk Note + One(1) 2 Tk
Note [=1*1Tk + 1*2Tk]
Best Solution for 4 Tk Change = 2-Notes One(1) 2 Tk Note + One(1) 2 Tk
Note [=1*2Tk + 1*2Tk]
Best Solution for 5 Tk Change = 1-Note One(1) 5 Tk Note [=1*5Tk]
Best Solution for 6 Tk Change = 2-Notes One(1) 1 Tk Note + One(1) 5 Tk
21.1. Dynamic Programming 171

Note [=1*1Tk + 1*5Tk]


Best Solution for 7 Tk Change = 2-Notes One(1) 2 Tk Note + One(1) 5 Tk
Note [=1*2Tk + 1*5Tk]
...

A dynamic programming algorithm proceeds in this way by solving small prob-


lems, then combining them to find the solution to larger problems. Dynamic
programming can be thought of as bottom-up method for problem-solving.

Some Computational Tools in Brief

Edit Distance: Though the term Edit Distance was coined (Levenshtein,
1966) in the study of string first, this concept is now being used in DNA sequence
alignment with/without some customizations. Edit Distance of two strings is
the minimum number of edit operations (insertion-deletion or indel, substitu-
tion of symbol) needed to transform one string into another. Some examples
are shown bellow-

A-TGCA
AGTC-A

indel = 2
substitution = 1
Edit Distance = 3(=2+1)

-ATGT-C
CAG–GC

indel = 4
substitution = 1
Edit Distance = 5(=4+1)

Edit Graph: An Edit Graph is a Two-Dimensional grid showing graphi-


cally all possible alignment path between two string considering the match and
indel operations of the alignment.

21.1.2 Dynamic Programming Algorithm and Sequence


Alignment
Lets have two DNA (or Protein) sequences, U and V of length m and n respec-
tively. Alignment of these two sequences can be expressed through a special type
of matrix named Alignment Matrix, which has two rows and at most (m+n)
columns. In this matrix a column may have values in two rows reflecting match
172 21. Dynamic Programming And Bioinformatics

Figure 21.1: Edit Graph (This is a Temp-Pic, New To be Drawn Later)

or substitution or any one row is blank reflecting indel (insertion or deletion)


operation. An example of alignment is shown bellow.

U=AAG
V=AGC

U=AAG-
V=-AGC

Two Matches (A:A, G:G)


Zero Substitution ()
Two Indels (A:-, -:C)

There may have many possible alignments, but target is to find out the best
(optimal) alignment. To grade an alignment with performance, there is needed
a scoring mechanism. The scoring mechanism will award a match-alignment
and penalize substitution-alignment and indel-alignment. The highest scoring
alignment will be chosen as best or optimal alignment.

If a provision of substitution is incorporated in the Edit-Distance concept, it


will turn into a sequence alignment problem. This problem can be solve by
progressively building an Alignment-Scoring-Matrix ([m+1]X[n+1] - Matrix)
starting from 0-length alignment to N (N¡=m+n) length. Each column of the
final Alignment Matrix presents either of Match, Substitution, Insert or Delete
operation to align the corresponding sequences.
21.1. Dynamic Programming 173

Figure 21.2: Alignment Operations (This is a Temp-Pic, New To be Drawn


Later)

Each cell(ASM[i,j]; 0¡=i¡=m, 0¡=j¡=n) of Alignment Scoring Matrix (ASM)


contains the best score for i-length U and j-length V subsequence alignment
with its direction of alignment.

Lets start with the assumption that 0-length subsequences are alined and set
ASM[0,0] = 0 as an initial alignment score.

There can have three different choices/paths to update/align every cell of the
ASM[i,j] matrix. The indels and substitution penalty and match-award can be
derived from predefined indel/substitution/match-scoring matrix. (for our illus-
tration lets assume each indel costs 1 and substitution costs 2 and each match
awards 2 point to align).

Then fill up the first row and column with insertion and deletion operation
respectively. And then all other cells are filled up using the recurrence function
of score.

After completion of the ASM matrix, backtracking the path of alignment from
end (ASM[m,n]) to start (ASM[0,0]) results the optimal alignment.

21.1.3 Algorithm in Pseudocode

The alignment procedure described above can be drafted as an algorithm in


two parts i) Alignment-Scoring-Matrix building and Alignment-Matrix building
from the ASM. The pseudocode is as bellow.

Description for this Section will be written Latter


174 21. Dynamic Programming And Bioinformatics

Figure 21.3: ASM Matrix Building (This is a Temp-Pic, New To be Drawn


Later)

Figure 21.4: ASM Matrix Cell Building (This is a Temp-Pic, New To be Drawn
Later)

21.1.4 Global Alignment & DP


The above described alignment aligns the both sequences from end-to-end re-
sulting a global alignment of two sequences.

21.1.5 Local Alignment & DP


In 1981 Temple Smith and Michael Waterman proposed a clever modification
of the global sequence alignment dynamic programming algorithm that solves
the Local Alignment problem.

21.1.6 Alignment with Gap Penalty


Mutations are usually caused by errors in DNA replication. Nature frequently
deletes or inserts entire substrings as a unit, as opposed to deleting or inserting
individual nucleotides. A gap in an alignment is defined as a contiguous se-
quence of spaces in one of the rows. Since insertions and deletions of substrings
21.1. Dynamic Programming 175

Figure 21.5: Dynamic Scoring Function(This is a Temp-Pic, New To be Drawn


Later)

Figure 21.6: BLOSUM50 Matrix(This is a Temp-Pic, New To be Drawn Later)

are common evolutionary events, penalizing a gap of length x as -sx is cruel


and unusual punishment. Many practical alignment algorithms use a softer ap-
proach to gap penalties and penalize a gap of x spaces by a function that grows
slower than the sum of penalties for x indels.

To this end, we define affine gap penalties to be a linearly weighted score for
large gaps. We can set the score for a gap of length x to be -(? + sx), where
? ¿ 0 is the penalty for the introduction of the gap and s ¿ 0 is the penalty for
each symbol in the gap (? is typically large while s is typically small). Though
this may seem to be complicating our alignment approach, it turns out that the
edit graph representation of the problem is robust enough to accommodate it.
176 21. Dynamic Programming And Bioinformatics

Figure 21.7: DP Algo(This is a Temp-Pic, New To be Drawn Later)

21.1.7 Multiple Alignment & DP


21.1.8 Other Applications of Dynamic Programming
• Prediction of RNA Secondary Structure
21.1. Dynamic Programming 177

Figure 21.8: Local Alignment 1(This is a Temp-Pic, New To be Drawn Later)


178 21. Dynamic Programming And Bioinformatics

Figure 21.9: Local Alignment 2(This is a Temp-Pic, New To be Drawn Later)

Figure 21.10: Gap Penalty(This is a Temp-Pic, New To be Drawn Later)

Figure 21.11: Multiple Alignment(This is a Temp-Pic, New To be Drawn Later)


Chapter 22

Neural Network And


Bioinformatics

—Md. Towhidul Islam

Neural Network And Bioinformatics . . . Neural Network And Bioinformatics


. . . Neural Network And Bioinformatics . . . Neural Network And Bioinfor-
matics . . . Neural Network And Bioinformatics . . . Neural Network And
Bioinformatics . . . Neural Network And Bioinformatics . . . Neural Network
And Bioinformatics . . . Neural Network And Bioinformatics . . . Neural
Network And Bioinformatics . . . Neural Network And Bioinformatics . . .
Neural Network And Bioinformatics . . . Neural Network And Bioinformatics
. . . Neural Network And Bioinformatics . . . Neural Network And Bioinfor-
matics . . . Neural Network And Bioinformatics . . . Neural Network And
Bioinformatics . . . Neural Network And Bioinformatics . . . Neural Network
And Bioinformatics . . . Neural Network And Bioinformatics . . . Neural
Network And Bioinformatics . . . Neural Network And Bioinformatics . . .
Neural Network And Bioinformatics . . .

22.1 Machine Learing


. . . . . . . . . . . . .. . . . .. . . .

A basic learning model typically consists of the following four components:

• Learning element, responsible for improving its performance.

• Performance element, which decides the choice of actions to be taken.

• Critical element, which tells learning element how the algorithm performs,
and

179
180 22. Neural Network And Bioinformatics

• Problem generator, responsible for suggesting actions that could lead to


new or informative experiences.
Machine learning typically can be divided into three phases, as follows:

1. Analysis of a training set of examples and generation of a set of rulesfrom


training set.
2. Verification of the rules by human experts or automatic knowledgebased
components, and
3. Use of the validated rules in responding to some new testing datasets
(Finlay and Dix 1996).

22.1.1 Why Machine Learning in Bioinformatics


There are a number of reasons why machine learning approaches arewidely used
in practice, especially in bioinformatics (Narayanan et al.,2002; Nilsson, 1996;
Baldi and Brunak, 2001; and Westhead et al., 2002)

• Traditionally, a human being builds such an expert system by collecting


knowledge from specific experts. The experts can always explain what
factors they use to assess a situation; however, it is often difficult for the
experts to say what rules they use, for example, for disease analysis and
control. This problem can be resolved by machine learning mechanisms.
Machine learning can extract the description of the hidden situation in
terms of those factors and then fire rules that match the experts behavior.
• Systems often produce results different from the desired ones.
This may be caused by unknown properties or functions of inputs during
the design of systems. This situation always occurs in the biological world
because of the complexities and mysteries of life sciences. However,
with its capability of dynamic improvement, machine learning can cope
with this problem.
• In molecular biology research, new data and concepts are generated
every day, and those new data and concepts update or replace the old
ones. Machine learning can be easily adapted to a changing environment.
This benefits system designers, as they do not need to redesign systems
whenever the environment changes.
• Missing and noisy data is one characteristic of biological data. The
conventional computer techniques fail to handle this. Machine learning
techniques are able to deal with missing and noisy data.
• With advances in biotechnology, huge volumes of biological data are gen-
erated. In addition, it is possible that important hidden relationships and
correlations exist in the data. Machine learning methods are designed to
handle very large data sets, and can be used to extract such relationships.
22.2. Artificial Neural Network 181

22.2 Artificial Neural Network


The human brain has been studied since the late Middle Ages; however, its de-
tailed structure began to be unraveled only in the nineteenth century. Specialists
claim that the brain is a collection of about 10 billion densely interconnected
cellular units called neurons. The structure of a neuron and its network is shown
in The fig.Each neuron consists of a cell body called soma, a number of root-like
extensions connected to a thousand adjacent neurons called dendrites, and a
single transmission line extending out from the soma called axon. The two spe-
cialized extensions of a soma are responsible for carrying information from/to a
cell body. Dendrites bring information to a cell body and axons take informa-
tion away from a cell body. The connection between two neurons, in particular,
between an axon terminal and another neuron, is called synapse.

Adapted from: http://ffden2.phys.uaf.edu/-212 fall2003


.web.dir/Keith Palchikoff/Intro page.html

Figure 22.1: Biological Neural Network

Each neuron uses biochemical reactions to receive processes and trans-mit


information. Neurons communicate with each other through an electrochemical
process. This means that chemicals create an electrical signal. When a neuron
does not send a signal, it is in a resting state. The inside of the neuron has a
negative electric potential. When a neuron sends a signal, it causes a change
in the electrical potential of the cell body. The change occurs due to the release
of chemical substances from the synaptic cell, called neurotransmitter. When
the potential exceeds a certain threshold, an action potential occurs. Conse-
182 22. Neural Network And Bioinformatics

quently, the neuron will fire the electrical signal down the axon. The occurrence
of action potential can be increased or decreased by changing the constitution
of various neurotransmitters.

An essential characteristic of biological neural networks is plasticity, an abil-


ity of the brain to reorganize with learning, based on experience or sensory
stimulation. Scientists believe that there are two types of modifications that
form the basis of learning in the brain, namely,

1. a change in the internal structure of the synapses and

2. an increase in the number of synapses between neurons.

22.3 Neural Network Architecture


22.3.1 Feed-Forward Neural Networks
A perceptron is the most basic and the simplest feed-forward neural network
model. It consists of an input layer and a single output layer of processing units
called nodes. Input values presented to neurons in the input layer are mapped
directly to neurons in the output layer. There are no intermediate processing
steps. Each input is associated with a weight to reflect the significance of the
input to the output. Given a set of training patterns that consist of exemplar
”input” and ”desired output” pairs, the perceptron is trained by feeding the in-
put patterns to it and minimizing the error between its outputs and the desired
outputs. Since the perceptron performs a direct mapping of input to output, it
is a linear classifier, because only its weights define a hyperplane that divides
input space into regions of pattern classes. The perceptron, therefore is, inca-
pable of performing tasks that require nonlinear mappings between input and
output.

For more complicated problems, a linear hyperplane is not good enough as a


separator. A nonlinear surface that separates the classes is used instead. This
can be achieved by the multi-layer perceptron (MLP), or the feed-forward
network that consists of three layers of nodes, or neurons. Besides having an
input layer and output layer, MLP has one (or several) hidden layer(s) in the
middle. All artificial neural networks have a similar structure or topology, as
shown in Figure.
Input data is a long continuous-valued vector that contains n elements, x =
(x1 , x2 , ..., xn ). The n elements can be considered as the lengths of the inputs,
and are determined by the problem specification. Each hidden neuron (i =
1, 2, ..., m) stores an exemplar training sample faithfully as its weight vector
w = (wi,1 , wi,2 , ..., wi,1 ). A hidden neuron i is computed from the inputs
X
hi = F ( Wi,n Xn ) (22.1)
n
22.3. Neural Network Architecture 183

The architecture of a multi-layer perceptron

Figure 22.2: Multi-Layer Perceptron

where Xn denotes the nth input and Wi,n denotes the weights between the
input and hidden layers.

The hidden neurons are then used as inputs for the output y
X
yi = G( Vi,n hn ) (22.2)
n

Where Vi,n denotes the weights between the hidden and output layers. The
activation function F or G is a sigmoid or logistic function which is usually dif-
ferentiable and contributes to stability in neural network learning (Narayanan
et al., 2003a).

Despite the simplicity of neural network, the summation functions can be more
complex than just the simple sum of the products of inputs and their weights.
The specific algorithm to combine neural inputs is determined by the chosen
network architecture and hypothesis.

22.3.2 Training of Feed-Forward Neural Networks


Once a network has been structured for a problem specification, training of
the network is the next step to be followed. The training of the network is
nothing but finding the weights to minimize possible error. The initial weights
are allocated randomly. Then, the training, or learning, begins. The commonly
used algorithm for error is defined by
sX
2
E= (ti − Oi ) (22.3)
i
184 22. Neural Network And Bioinformatics

where , ti is the target output and Oi is the actual output. The steps used
to find the weights for minimizing error are:

• choose the initial weights randomly for a sample input values,

• compare the actual output value with the target output value,

• calculate the error, and

• modify the weights so that the actual output is closer to the target output
next time, with smaller error.

This process is repeated for all samples in the dataset and results, and then
repeated until the output error for all the samples achieves an acceptable low
value, which indicates the end point of the training. Once the training is finished,
testing can be done using the rest of the data set, not used during the training
phase, to test the trained neural network. If the testing is not satisfactory,
further modification of the weights has to be done. Otherwise, the output value
of the tested data is preserved for any decision making.

22.4 Neural Network Learning Algorithms


There are many different types of neural networks. Based on the type of learn-
ing, they can be categorized into supervised and unsupervised neural networks.

22.4.1 Supervised Learning Neural Networks


Most neural networks are trained with supervised training algorithms. This
means that the desired output must be provided for each input used in the
training. In other words, both the inputs and the outputs are already known.
In supervised training, a network processes the inputs and compares its actual
outputs against the expected outputs. Errors are then propagated back through
the network, and the weights that control the network are changed. This process
is repeated until the errors are minimized. This means that the same dataset
is processed many times while the weights between the layers of the network
are being refined during the training of the network. The architecture for a su-
pervised neural network that includes three layers, namely, input layer, output
layer and, a hidden layer in the middle.

Support vector machines (SVMs) are considered supervised computer learning


methods. Since the support vector machine (SVM) is well known as a training
algorithm for learning classification from data, SVMs, as one of the major super-
vised neural networks, are widely used for the applications of classification and
pattern recognition problems in bioinformatics (Vapnik, 1995, and Cristianini
and Shawe-Taylor, 2000).
22.4. Neural Network Learning Algorithms 185

The theory of SVMs can be applied to the clustering of yeast microarray expres-
sion data. When the misclassification rates of SVMs are compared with those of
other machine learning approaches, SVMs are found to be the best performing
methods (Brown et al., 2000). In addition to their use for evaluating microar-
ray expression data, SVMs have been shown to perform well in multiple areas
of biological analysis, including detecting remote protein homologies (Jaakkola,
1999) and recognizing translation initiation sites. SVMs can also be used to an-
alyze expression data (Furey et al., 2000). Gene expression data is usually high
dimensional data that constitutes a serious problem in several machine learning
methods. Dimensionality reduction can be used, but it leads often to informa-
tion loss and performance degradation. Fortunately, SVMs can overcome this
problem as they can generalize high dimensional data well (Valentini, 2002).

22.4.2 Unsupervised Learning Neural Networks


The learning algorithm used in unsupervised neural networks is an unsupervised
learning algorithm. In unsupervised training, the network is provided only with
inputs, while the expected output is unknown. The neural network must itself
choose features to group the input data without being trained (Agatonovic-
Kustrin and Beresford, 2000). Once an unsupervised neural network has been
trained, it must be tested to show that the network really represents the data;
the data is expected to be well represented in clusters.

A self-organizing network known as self-organizing map (SOM), or Kohonen


network, is the most common algorithm used in unsupervised neural net-
works (Kohonen, 1982). It is different from the supervised learning described
earlier. The neighborhood of a neuron is used to find and group the data that
has the similarity. The grouped neurons are arranged in a matrix pattern
called a map. Every input neuron is connected to other neurons in this map.
Finally, these neurons form the output of the neural Network.

The SOM consists of an input layer and a competitive output layer. The
output layer is normally organized into a two-dimensional grid of fully connected
neurons, as illustrated in Fig. 22.3. The input vectors are fed into input layer
and mapped with competitive neurons in the output layer. The competition
learning algorithm in the output layer ensures that similar input vectors are
mapped with competitive neurons that are closer to each other in the grid than
dissimilar ones. In SOM, input vectors in high dimensional space are, therefore,
projected on to two-dimensional output space based on their spatial similari-
ties. Similar input patterns are clustered into one small region in the grid of the
output layer.
The SOM is widely used as a data mining and visualization method in bioin-
formatics. It is a more robust and accurate method for the clustering of large
amounts of noisy data than hierarchical clustering methods are for analyzing
the gene expression data. In the analysis of the Stanford yeast gene expres-
sion dataset using SOMs, the best performance of gene expression analysis was
186 22. Neural Network And Bioinformatics

Self-organizing map (adapted from Narayanan et al., 2003a)

Figure 22.3: Self-Organizing Map

the result of combining clustering and visualization methods (Torkkola et al.,


2001). SOMs, can be used to reduce the amount of data through clustering,
and to construct a nonlinear projection of the data onto a low dimensional dis-
play simultaneously. Therefore, SOMs can be used to combine aspects of gene
analysis, namely, clustering and visualization.

Nevertheless, this approach presents several problems (Fritzke, 1994). They


are as follows:

• As the SOM is a topology-preserving neural network, the number of clus-


ters is randomly fixed from the beginning. Therefore, the clustering ob-
tained is not proportionate.
• The lack of a tree structure makes it impossible to detect higher order
relationships between clusters.

The hierarchical clustering and the SOM can be combined(SOTA) to sur-


mount the problems faced by these methods in analyzing the gene expression
profiles and the gene expression data from DNA array experiments (Herrero
et al., 2001, and Dopazo and Carazo, 1997). The advantages of SOTA are as
follows:

• the clustering obtained is proportional to the heterogeneity of the data


• the binary topology produces a nested structure in which nodes at each
level are averages of the items below them.

An alternative way to avoid the problems is to use Fuzzy Kohonen Neu-


ral Networks that combines a Kohonen network and a fuzzy c-means algo-
rithm to keep the advantages and overcome the shortcomings of both techniques
(Granzow et al., 2001).

The advantages of the SOM can be attributed to its ability to map high di-
mensional data onto more comprehensible lower dimensional space and to its
22.4. Neural Network Learning Algorithms 187

fast execution. It is potentially very useful for dealing with high dimensional-
ity and large-scale databases to extract information from gene expression data.
However, the effectiveness of its combining with database queries warrants fur-
ther investigation. SOM also has limitations, namely,

1. no convergence guarantee and


2. the nondeterministic results that depend on learning rates.

Related Ref:
(Adeli, 1995; Finlay and Dix 1996; 118 Supawan Prompramote et al. Kuo-
nen, 2003; Narayanan et al., 2002; Negnevitsky, 2002; Nilsson, 1996; Baldi and
Brunak, 2001; and Westhead et al., 2002).
188 22. Neural Network And Bioinformatics
Chapter 23

Hidden Markov Model


(HMM) And Bioinformatics

—Saddam Hossain
&
—Fokhruzzaman

Hidden Markov Model (HMM) And Bioinformatics . . . Hidden Markov Model


(HMM) And Bioinformatics . . . Hidden Markov Model (HMM) And Bioinfor-
matics . . . Hidden Markov Model (HMM) And Bioinformatics . . . Hidden
Markov Model (HMM) And Bioinformatics . . . Hidden Markov Model (HMM)
And Bioinformatics . . . Hidden Markov Model (HMM) And Bioinformatics .
. . Hidden Markov Model (HMM) And Bioinformatics . . . Hidden Markov
Model (HMM) And Bioinformatics . . . Hidden Markov Model (HMM) And
Bioinformatics . . . Hidden Markov Model (HMM) And Bioinformatics . . .

23.1 Introduction
The very intuitive and natural question from any biologist or bioinformatist,
when they have a DNA or Amino acid (Protein) sequence in hand is that what
the sequence represent. For example, is a particular DNA sequence a gene or
not? Another example would be to identify which family of proteins a given
protein belons to? In both the cases, we have a sequence of symbols from some
alphabet and we are required to say something about the structure of that
sequence. There are some techniques that can be used to model this kind of
sequence problem, among these, Markov Chains and Hidden Markov Models
serve as probabilistic models for sequence. We will concentrate on some famous
biological problems to explain the Markov Chains Models and Hidden Markov
Models, and their applications. The first of these is identifying CpG islands in
a DNA sequence. Lets define the CpG islands problem first.

189
190 23. Hidden Markov Model (HMM) And Bioinformatics

23.2 CpG islands


In a single strand of DNA sequence, two nucleotides placed sequentially side-
by-side is called dinucleotides. The CG pair of nucleotides is called CG dinu-
cleotide. CG dinucleotide is the most infrequent dinucleotide in human genome.
We will denote this dinucleotide by CpG to distinguish it from a C-G pair across
the two strands of the DNA, that C bonds with G, and in CpG, ’p’ indicates
that C and G are connected by a phosphodiester bond. The reason why the
CpG dinucleotide is infrequent is that the C in CpG has a tendency to become
methyl-C, by a process called methylation (that is, an H-atom is replaced by a
CH3 -group). Methyl-C in turn has a high chance in mutating to a T.

30 ...A C TA G ...50
50 ...T G AT C ...30

Figure 23.1: C − G pair

30 ...A CG TTA...50

Figure 23.2: CG dinucleotide

30 ...A CG T CG AG CG TACTGTTACTCAGTCTTAG...50

Figure 23.3: CpG islands

Due to the methylation process, the CpG dinucleotide is rarer than would
be expected by the independent probabilities of C and G. In human genome,
CG dinucleotide occurs with frequency < 1%. This is the least frequent dinu-
cleotide. However, for biologically important reasons, the methylation process
is suppressed in upstream areas around genes and hence these areas contain a
relatively high concentration of the CpG dinucleotide. Such regions are called
CpG islands, whose length varies from few hundreds to few thousands bases in
the promoter regions of genes. CpG-islands in the promoter-regions of genes
play an important role in the deactivation of a copy of the X-chromosome in
females, in genetic imprinting an in the deactivation of intra-genomic parasites.
According to a recent study, human chromosomes 21 and 22 contain about 1100
CpG-islands. About 56% of the human genes are associated with an region of
CpG islands. The presence of a CpG island can be an indication to the start of a
gene. Therefore identifying CpG islands helps to determine the location of genes
across the DNA. There may arise two very frequent questions - i) Given a short
sequence, is it from a CpG island or not? and ii) Given a long sequence, does
it contains a CpG island or not?. We will continue our discussion of Markov
chain in try of answering these questions.
23.3. Markov Chain 191

NH2 NH2 O

H b "b bN CH3 b "b bN CH3 b "b H


b" b b" b b" bN"
Methylation Mutation
−−−−−−−→ −−−−−−→
"b " b
b
" b N" b
b "b " b
b
" b N" b
b "b
" "b
b N" b
b
H O H O H O
b

H H H

Figure 23.4: DNA Methylation

GCCTACACAC CG CCAGTTGTGTTCCTGCTATGTCTCTAGTGATCCCTGAA
AAGTTCCAG CG TATTTTG CG AATACTCAACAGCAACATCAA CG GGCAG
CAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGGTTGACAGTA
CACTCATAGTGTTGAGGAAAGCTGA CG TTGACCTCACCAAGTGGGCAGGA
GAACTCACTGAGGATGAGATGGAA CG TGTGATGACCATTATGCAGAATCC
ATGCCAGTACAAGATCCCAGACTGGTTCTTG

Figure 23.5: A Sequence from Human Genome with CG-dinucleotides

23.3 Markov Chain


To model to answer the above CpG-islands questions, it is clear that we are
lookng for the CpG dinucleotide, rather just individual nucleotide symbols.
Modeling dinucleotides for a sequence is necessary. Pair of sequence can be
modeled using joint probability distribution of alignments. The joint proba-
bility distribution is estimated from statistical data pertaining to alignments
of many sequences. Will it work if we do the dinucleotide model with joint
probability distribution for dinucleotides? Obviously not, because we need to
model pairs of symbols in the same sequence, and not a pair across two strands.
Therefore, the second symbol of one pair, is the first symbol of the other, and
there is clearly a dependence that need to be captured by our dinucleotide model.

Our goal is to come up with a probabilstic model for CpG-islands. Therefore,


we need to build a model that generates sequences in which the probability of
a symbol depends on the probability of its predecessor (previous symbol). This
way we will be able to capture the notion of two consecutive dinucleotides on
a strand. A suitable model that can capture this kind of dependency would
be a Markov Chain. A Markov Chain is collection of States with Transition
Probabilities between the states. And in a Markov Chain, the probability of
next state depends of the current state we are in. Lets have a formal defition of
Markov Chain first.

Definition. A Markov Chain is a set of States, Q = {q1 , q2 , q3 , ..., qn } with


a Transition Probabilities Matrix, A for each pair of states i and j, A = [ai,j ]
192 23. Hidden Markov Model (HMM) And Bioinformatics

P
holding the property of j aij = 1. And a Markov Chain Model is defined by
(Q, A).

Lets examine the smallest possible Markov Chain with two states (is drawn
as circles), Q = {1, 2}, named as 1 and 2, with the transition probability (is
drawn as transition edge, a directed edge) of moving from state-1 to state-2 is
p, that is a1,2 = p and transition within the state itself is 1 − p, which is a loop
transition probability a1,1 = (1 − p). In the same manner the other transition
probabilities are a2,2 = (1 − p), a2,1 = p. And the complete transition matrix is
as bellow. And this Markov Chain Model can be defined as M = {Q, A}.
   
a1,1 a1,2 1−p p
A= =
a2,1 a2,2 p 1−p

(1 − p) (1 − p)
p

1 2
p

Figure 23.6: Markov Chain for Two States

Markov Property: In the Markov Chain the transition from one state to
another is in discrete time steps, n = 1, 2, 3, .... If we are in state i at time
step n, we go to state j in time step n + 1 with probability ai,j (the transition
probability from state i to state j). We also assume that the state at time n, qn
depends on the states q1 , q2 , q3 , .... But Markov Chain of Order-1 is the model
that the transition probabilities of the Markov Chain can only ”remember”
one state of its history. Beyond this it is memoryless. The ”memorylessness”
condition is very important. it is called the Markov Property. Below it has been
shown mathematically.

p(qn = j | q1 , q2 , q3 , ..., qm−1 , qm = i) = p(qn = j | xm = i)


= ai,j
[If m = n − 1]

A Markov Chain can produce a sequence through transition from its states-
to-states. For example a sequence like 1221 can be obtained from the above
mentioned Markov Chain. The corresponding transition path would be like
starting from state-1 going to state-2 and then having a loop transition to
state-2 itself and returning back to state-1. As the transition from one state
to another is associated with some transition probability, the whole path that
23.3. Markov Chain 193

generates the sequence of states like 1221, it generates with some probability.
For our next course of study , we assume that states are presented by the
set Q = {q1 , q2 , q3 , ..., qn }, and the sequence that can be generated from ther
Markov Chain of corresponding states and transition matrix will be represented
as X = x1 , x2 , x3 , ..., xL for a sequence of lenght L. Where the sequence element
xi (1 ≤ i ≤ L) is generated by some corresponding state qj , 1 ≤ j ≤ n.

Let us now look at the probability of obtaining a sequence of Markov Chain


x = x1 , x2 , x3 , ..., xL from the given sequence of states q = q1 , q2 , q3 , ..., qL of
our Markov Chain Model. This is basically the probability of a path x1 ...xL
in the chain. This can be expressed as the probability of starting in state
q1 , (xi sequencesymbolisemittedf romstateqi ) and making successive transitions
to q2 , q3 , ..., qL . Now using the Markov Property the probability of the sequence
x, p(x) can be calculated in the following manner.

p(x) = p(x1 , x2 , x3 , ..., xL )


= p(xL , xL−1 , xL−2 , ..., x3 , x2 , x1 )
= p(xL = qL , xL−1 = qL−1 , xL−2 = qL−2 , ..., x3 = q3 , x2 = q2 , x1 = q1 )
= p(xL = qL | xL−1 = qL−1 , xL−2 = qL−2 , ..., x3 = q3 , x2 = q2 , x1 = q1 )
p(xL−1 = qL−1 , xL−2 = qL−2 , ..., x3 = q3 , x2 = q2 , x1 = q1 )
= p(xL = qL | xL−1 = qL−1 )p(xL−1 = qL−1 , ..., x2 = q2 , x1 = q1 )
(Markov Property)
= aqL−1 ,qL p(xL−1 = qL−1 , ..., x2 = q2 , x1 = q1 )
(Applying Same Rule Repeatatively)
= ...
= p(x1 = q1 )aq1 ,q2 ...aqL−1 ,qL
L−1
Y
= p(x1 = q1 ) aqi ,qi+1
i=1

But what is the probability of starting in state q1 , p(x1 = q1 )?. This has
to be given, so we must have a probability distribution for the starting state.
Alternatively, we can model this by explicitly adding a start state with transition
probabilities to all other states. We will always start with that special start
state. Let the start state be denoted by q0 and its emitting sequence symbol by
x0 , then p(x1 = q1 ) = p(x1 = q1 | x0 = q0 ) = aq0 ,q1 . So the previous equation
becomes-
194 23. Hidden Markov Model (HMM) And Bioinformatics

p(x) = p(x0 , x1 , x2 , x3 , ..., xL )


= aq0 ,q1 aq1 ,q2 ...aqL−1 ,qL
L−1
Y
= aqi ,qi+1
i=0

Similarly, we can explicitly add an end state. Although not needed, having
an end state will help us model the length of the sequences too, because, it will
induce a probability distribution for the length of a path in the Markov Chain.
Therefore each state (including the start state, i.e. empty sequence) will have
a transition probability to that special end state. The probability of ending a
sequence in state qL is aqL ,qL=1 . Then the probability of the path will be:

p(x) = p(x0 , x1 , x2 , x3 , ..., xL , xL+1 )


= aq0 ,q1 aq1 ,q2 ...aqL−1 ,qL aqL ,qL+1
L
Y
= aqi ,qi+1
i=0

Hence the probability of a path is the product of the probability of transitions


along its edges. In the following sections, we will model the Markov Chain with
these special starting (beginning) and ending states as q0 = β, and qL+1 = .

β q1 ... qL 

Figure 23.7: Complete Markov Chain Template

To answer the previously imposed questions, we need to have a dinucleotide


model first. We can model a sequence of dinucleotides as the Markov Chain
shown below. Two special states beyond the nucleotides {A, T, C, G}, have
been placed to complete the whole model(as discussed above), they are Begin
(β) and End State ().
We still need to determine the parameters of the Markov Chain, namely
the transition probabilities between the states, A = [ai,j ]. These transition
probabilities will have to differ in the case of CpG-islands and non−CpG-islands.
For instance, in CpG-islands we expect to see high transition probabilities to
nucleotides (states) C and G. We will estimate the transition probabilities from
23.3. Markov Chain 195

A T

β 

C G

Figure 23.8: Markov Chain for Dinucleotides

statistical data about CpG-islands and non − CpG-islands. We will therefore


build two Markov Models one for CpG-islands (M + = {Q+ , A+ }) and another
for non − CpG-islands (M − = {Q− , A− }). Markov Chains for these two models
are shown below.

A+ T+

β 

C+ G+

Figure 23.9: Markov Chain for Dinucleotides from CpG-islands

The states in these models are Q+ = {A+ , T + , C + , G+ , β, } and Q− =


{A , T − , C − , G− , β, }. All the states from these two models, except β and 

emits its corresponding nucleotide (A for A+ andA− ) with probability 1. Be-


gin (β) and End () emits null or nothing. And the corresponding transition
matrix (A+ = [a+ i,j ] and A

= [a−i,j ]) are given below, this example has been
derived from a case study where 60, 000 nucleotides were being considered for
the statistics.
196 23. Hidden Markov Model (HMM) And Bioinformatics

A− T−

β 

C− G−

Figure 23.10: Markov Chain for Dinucleotides from non − CpG-islands

   
A C G T A C G T
A 0.180 0.274 0.426 0.120
 − A 0.300 0.205 0.285 0.210

A = C 0.171 0.368 0.274 0.188 A = 
+
 
C 0.322 0.298 0.078 0.302
 

G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208
T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292

Then given a sequence x, we compute the probability p+ (x) of obtaining the


sequence in the CpG-islands Markov Chain, and the probability p− (x) of obtain-
ing the sequence in the non−CpG-islands Markov Chain. And the probabilities
can be determined using the following equations.
L
Y
p+ (x) = p(x | M+ ) = a+
qi ,qi+1 (23.1)
i=0
L
Y
p− (x) = p(x | M− ) = a−
qi ,qi+1 (23.2)
i=0

For, illustration lets examine the probability of obtaining sequence x =


CGCG from both of the models. (Lets ignore Start and End Transition Prob-
abilities for Simplicity).

p+ (CGCG) = p(CGCG | M+ )
= a+ + + + +
q0 ,q1 aq1 ,q2 aq2 ,q3 aq3 ,q4 aq4 ,q5
= a+ + +
q1 ,q2 aq2 ,q3 aq3 ,q4
= a+ + +
CG aGC aCG
= 0.274 ∗ 0.339 ∗ 0.274
= 0.025450764
23.3. Markov Chain 197

Similarly for non − CpG-islands Markov Chain -

p− (CGCG) = p(CGCG | M− )
= 0.078 ∗ 0.246 ∗ 0.078
= 0.001496664

This is very obvious that the probability of the sequence CGCG of being in
the CpG-islands is higher than that of non − CpG-islands. We can also use the
p(x|M+ )
log-odds ratio log p(x|M − ) to determine if x is coming from a CpG-island or not.

p(x|M+ )
If log p(x|M− ) > 0, the x is coming from a CpG-island.

The above strategy can answer our first question, Question: i) Given a short
sequence, is it from a CpG island or not?. But what about second question,
Question: ii) Given a long sequence, does it contains a CpG island or not? The
dual Markov Chain model that we have developed above to find CpG-islands
in a long sequence of nucleotides. How this can be done? This can be done by
taking windows of small sizes, say 100 nucleotides, in the long sequence. For
each window (which is a short sequence), the log-odds ratios are calculated,
as described above. Therefore, we can identify windows with positive log-odds
ratio and then merge intersecting windows to determine which part of the long
sequence are CpG-islands.

The disadvantage of this sliding-window based approach is that CpG-islands


tend to have variable length, and a window of size 100 might not be appropiate
to judge. If the window size is too small, we tend to have every occurence of
CG-dinucleotide as an island by itself. If the window size is too large, we can
not achieve enough discrimination.

A better way is to incorporate both CpG-island and non − CpG-island models


into one model. This can act as a single Markov Model consisting of both
(+) and (−) chains. And new transitional probabilities between these two
chains. This combined model is shown below. The advantage of this model
is its trainability. The transition probabilities between the two sub-chains can
be estimated by relying on known annotated sequences with all their transitions
between CpG and nonCpG-islands. This way we can remove the dependency
of particular window size.

But still there is a problem with this model. That is - this model does not
stablish a ono-to-on correspondence between the states and the symbols of the
sequence. For instance, the symbol C can be generated by both stated C + and
C − . Hence, a sequence does not correspond to a path in the model anymore,
but to multiple paths. In other term, a sequence, X = {x1 , x2 , x3 , ..., xL } does
not uniquely determines the path in the model. The states are Hidden in the
sense that the sequence itself does not reveal how it was generated. Therefore,
198 23. Hidden Markov Model (HMM) And Bioinformatics

we need to develop a slightly different theory for this new model, called Hidden
Markov Model.

A+ T+ C+ G+

β 

A− T− C− G−

Figure 23.11: Combined Model for CpG & non − CpG islands

23.4 Hidden Markov Models (HMM)


HMMs are just a little more complicated than the usual statistical models.
Statistical models deal with Random processes. Random process can be rep-
resented by a stochastic finite state machine with emitting states. HMM is
very useful way of representing information, which was originally developed for
speech processing. HMM is based on Markov Chain. In HMM it is assumed
that Observations are ordered in a sequence. In bioinformatics HMM is used in
different sequence problems, such as representing MSAs ( Given an MSA, how
do we represent this with a HMM) etc.

HMM Parameters:
• Transition probabilities
• Emission probabilities

HMM Estimation: HMM Estimaiton can be Called training of HMM, it


falls under machine learning. An HMM or architecture (given in advance) is
feeded by a set of observation sequences for training. The training process
iteratively alters its parameters to fit the training set.

HMM Usage:
• Evaluate the probability of an observation sequence given the model (For-
ward)
• Find the most likely path through the model for a given observation se-
quence (Viterbi)
23.5. HMM and Pair wise Sequence Alignment 199

23.5 HMM and Pair wise Sequence Alignment


23.6 Profile HMM
Protein Family Characterization
Profile HMM for Protein Family Characteriza-
tion
Profile HMM and Homology Search
23.7 HMM and Multiple Sequence Alignment
23.8 Advantages and Disadvantages using HMM
HMM Advantages:

• Statisticians are comfortable with the theory behind hidden Markov mod-
els

• Freedom to manipulate the training and verification processes

• Mathematical / theoretical analysis of the results and processes

• HMMs are still very powerful modeling tools - far more powerful than
many statistical methods

• HMMs can be combined into larger HMMs

• Assuming an architecture with a good design

• People can read the model and make sense of it

• The model itself can help increase understanding

• Incorporate prior knowledge into the architecture

• Initialize the model close to something believed to be correct

• Use prior knowledge to constrain training process

HMM Disadvantages:

• States are supposed to be independent

P (x) → ... → P (y)


200 23. Hidden Markov Model (HMM) And Bioinformatics

• P (y) must be independent of P (x), and vice versa, this usually isnt true
• Can get around it when relationships are local
• Not good for RNA folding problems
• Model may not converge to a truly optimal parameter set for a given
training set
• HMM is only as good as the training set
• More training is not always good, causes over-fitting
• Still slow in comparison to other methods

23.9 Other Application of HMM in Bioinfor-


matics
23.10 Gene Finding using HMM
The objective of Gene Finding is to find the coding and non-coding regions of
an unlabeled string of DNA nucleotides. The motivation beyond gene finding is
that it assists in the annotation of genomic data produced by genome sequencing
methods and to gain insight into the mechanisms involved in transcription,
splicing and other processes.

Gene Finding Terminologies: A string of DNA nucleotides containing a


gene will have separate regions (lines): Introns - non-coding regions within a
gene and Exons - coding regions. They are separated by functional sites (boxes).
There are also Start and stop codons, Splice sites - acceptors and donors.

Gene Finding Challenges:


• The correct reading frame is needed to identify.
• Introns can interrupt an exon in mid-codon
• There is no hard and fast rule for identifying donor and acceptor splice
sites
• Signals are very weak
Chapter 24

Genetic Programming And


Bioinformatics

—Saddam Hossain
Genetic Programming And Bioinformatics . . . Genetic Programming And
Bioinformatics . . . Genetic Programming And Bioinformatics . . . Genetic
Programming And Bioinformatics . . . Genetic Programming And Bioinformat-
ics . . . Genetic Programming And Bioinformatics . . . Genetic Programming
And Bioinformatics . . . Genetic Programming And Bioinformatics . . .
Genetic Programming And Bioinformatics . . . Genetic Programming And
Bioinformatics . . . Genetic Programming And Bioinformatics . . . Genetic
Programming And Bioinformatics . . . Genetic Programming And Bioinfor-
matics . . . Genetic Programming And Bioinformatics . . .

24.1 Genetic Programming


Genetic Programming (GP) is intrinsically one of the most famous and robust
search techniques for optimization or approximation problems. The method of
genetic programming is inspired by the Darwinian Paradigm (1859) of natural
evolution, the selection theorem, which is - a population survive through a selec-
tion process that is a competition to survive for the fittest: ”Survival of the
Fittest”, and the survived population go on to next generations by reproduc-
tion. This cycle goes on. Analogously, simply said, in the genetic programming
the solution to a problem is evolved. Genetic Programming (GP) may be used
interchangeably naming Genetic Algorithm (GA).

**********Fig***Cycle of Selection************

Genetic Algorithms are a part of Evolutionary Computing. The idea of evo-


lutionary computing was introduced in the 1960s by I. Rechenberg. His idea

201
202 24. Genetic Programming And Bioinformatics

was then developed by other researchers. Genetic algorithm is one of the results
of such researches, developed by John Holland and his students and colleagues
in 1975.

24.2 Analogy of Genetic Programming to Biol-


ogy
Genetic algorithms are search procedures based on the mechanics of genetics and
natural selection. There is an inherant analogy between the natural evolutionary
process of individuals and the genetic algorithm. An Individual in the genetic
programming is characterized by a set of parameters, which can be thought
as Genes. The genes are joined into a string and form Chromosome. The
chromosome forms the Genotype. The genotype contains all information to
construct an organism, the Phenotype. Reproduction is a “dumb” process on
the chromosome of the genotype. And fitness is measured in the real world
(“struggle for life”) of the phenotype.

24.3 Steps of Genetic Programming


The genetic programming flow starts with creation of an Initial Population
and computes the best approximate solution through Evaluation of Fitness,
Survivors Selection and Reproduction if Convergence has not reached yet.

Population Initialization: Genetic algorithm does some heuristic search in


the total search space. Each feasible candidate-solution of a particular problem
is in this space. It is called Individual and cumulatively more than one indi-
vidual forms a Population. Each individual is presented using some Coding
or Encoding Mechanisms. Initially some random population is generated with
some probabilistic functions, which could be as simple as the toss of a coin,
or computer generated or by some other complex probabilistic means. This
population is called the initial population.

Evaluation of Fitness: Each individual of a population is characterized by


a Fitness Function to evaluate and score its quality as of candidate-solution.
Higher fitness is Better solution. If the score of an individual satisfies the op-
timal criteria, it is the optimal or approximate solution to the problem. The
genetic programming algorithm come to an end with that individual as a result.
If not, reproduction is done for generating more individuals expecting better
individual in the population.

**********Fig***Steps of GP************
24.4. Basic Genetic Algorithm 203

Survivors Selection: Usually two individuals are selected from the popula-
tion based on their fitness, they are the parents to reproduce offspring for a new
generation. It is assumed that fitter individuals have more chance to reproduce
more fitter individuals. This new generation has same size as old generation,
and old generation dies, new generation come into place. Each iteration of the
loop is called a Generation.

Reproduction: Offspring has combination of properties of two parents. To


introduce more diversity reproduction is done through cross-over operation and
mutation is imposed to diversify the population more. Reproduction mecha-
nisms have no knowledge of the problem to be solved. The only link between
genetic algorithm and problem are Coding for individuals and Fitness Function.

Convergence: If the Fitness Funtion is well designed, population will con-


verge to optimal solution, comes out with individual of best or optimal score.
This individual is the output result for the corresponding problem.

24.4 Basic Genetic Algorithm


The steps genetic programming can be structured into a systematic algorithm as
illustrated below. The genetic algorithm quickly converge to optimal solutions
after examining only a small fraction of the search space.

Algorithm 3 GeneticAlgorithm
1: [Start] Generate random initial population, Generation G0 .
2: [Fitness] Evaluate the fitness of each individual in the population.
3: [New Population] Create a new population by repeating following steps
until the new population is complete.
4: [Selection] Select two parents according to their fitness.
5: [Crossover] With a crossover probabiloty, crossover the parents to form
a new offspring.
6: [Mutation] With a mutation probability, mutate new offspring.
7: [Accepting] Place new offspring in a new population.
8: [New Generation] Use new generated population, as Generation Gn for a
further run of the algorithm.
9: [Solution Test]
10: if A solution (individual) is got as optimal then
11: Go to [End]
12: end if
13: Loop back to [Fitness]
14: [End]
204 24. Genetic Programming And Bioinformatics

This is the most basic algorithmic outline for genetic programming. But
there are many things that can be implemented differently in various problems.
There are few points in the algorithm, which are very important to discuss
seperately. First of all is the mechanism how the chromosome or individual is
created in other words what type of encoding is used for presenting individuals.
And there are two basic operators for genetic programming. These and some
other points are discussed in the next section.

24.5 Some Points on Genetic Programming


When there comes the question of fomulation of genetic programming, four ma-
jor points are to be noted, as per the steps of genetic programming discussed
earlier. These are Encoding, Fitness Evaluation, Reproduction, Crossover, Mu-
tation and Survivor Selection.

24.5.1 Encoding
The chromosome should in some way contains information about solution which
it represents. And encoding is the process or method for representing a solution
or decision variable (chromosome or individual) containing that information for
the genetic programming. The decision variables of a problem can be encoded in
various fashions, and these are obviously finite length string. The usual encoding
mechanisms are - Binary Encoding, Permutation Encoding, Value Encoding, Tree
Encoding, etc. Though encoding very depends on the problem, Binary Encoding
is mostly used.

Binary Encoding: Bionary encoding is the most common, mainly because


first works about genetic programming used this type of encoding. In binary
encoding every chromosome is a string of bits, 0 or 1. Binary encoding gives
many possible chromosomes. On the other hand, this encoding is often not nat-
ural for many problems and sometimes corrections must be made after crossover
and/or mutation.
ChromosomeA: 10100101001
ChromosomeB: 01110010100

Example of Binary Encoding two chromosomes or individuals

Figure 24.1: Binary Encoding

Permutation Encoding: In permutation encoding, every chromosome is a


string of numbers, which represents number in a sequence. Permutation encod-
ing can be used in ordering problems, such as travelling salesman problem or
task ordering problem.
24.5. Some Points on Genetic Programming 205

ChromosomeA: 1452376
ChromosomeB: 6253417

Example of Permutation Encoding two chromosomes or individuals

Figure 24.2: Permutation Encoding

Value Encoding: In value encoding, every chromosome is a string of values.


Values can be anything connected to the corresponding problem. They can
be real or integer number, or charecters (string of charecters), or some other
complicated/compound objects.

ChromosomeA: 1.2 2.4 3.6


ChromosomeB: ABDCUGYH
ChromosomeC: ATCGTGCA
ChromosomeD: (black)(white)(black)(red)

Example of Value Encoding for different chromosomes or individuals, encoding


of ChromosomeA is string of real numbers, ChromosomeB is string of english
alphabet, ChromosomeC is string of nucleotides and ChromosomeD is string
of colors.

Figure 24.3: Permutation Encoding

Tree Encoding: Tree encoding system encodes the chromosome as a tree


of the related obejects representing the problem. Tree encoding is good for
evolving programs.

ATG

ATC ACG
Example of Tree encoding.

Figure 24.4: Tree encoding

24.5.2 Crossover
A percentage of the population is selected for breeding and assigned random
mates. This random mates and generation of new offspring are done through
crossover. Crossover is one of the two basic operators of genetic programming.
206 24. Genetic Programming And Bioinformatics

After the encoding has been decided, crossover method can be chosen. Crossover
selects genes from parent chromosomes and creates new offspring.The simplest
way of doing this is to choose randomly some crossover point(s). Depending on
this there may have several types of crossover, discussed bellow.

Single Point Crossover: Single Point Crossover selects one crossover point.
String from beginning of chromosome to the crossover point is copied from one
parent, the rest is copied from the second parent.

ATCG|CG + CTAG|TT = ATCGTT, CTAGCG

Single Point Crossover

Figure 24.5: Single Point Crossover

Multiple Point Crossover: In this case, two or more crossover points are
selected. Exchange of genes before and after the corresponding crossover points
are done to produce new offsprings.

AT|CG|CG + CT|AG|TT = ATAGCG, CT CGTT

Multiple Point (Two Point) Crossover

Figure 24.6: Multiple Point Crossover

Uniform Crossover: In the uniform crossover genes are randomly (with


some probability function) copied from the first or from the second parent to
produce offspring.
ATCGCG + CTAGTT = ATAGCT, C TCGT G

Uniform Crossover

Figure 24.7: Uniform Crossover

Arithmetic Crossover: If some arithmetic or logical operations are done


to compute the offspring, this process is called arithmetic crossover.

10001 + 11011 = 10001 (AND-Operation)

Arithmetic Crossover using logical operation AND

Figure 24.8: Arithmetic Crossover


24.6. Parameters of Genetic Programming 207

24.5.3 Mutation
Mutation is another basic operator of the two most important basic operators of
genetic programming. After a crossover is performed, mutation take place. This
is to prevent falling all solutions in population into a local optimum of solved
problem. Mutation changes randomly the new offspring. Depending encoding
mutation can be different. For binary encoding we can switch a few randomly
chosen bits from 1 to 0. In string of nucleotides, it may change randomly chosen
A into C.
ACTGGTCA → CCTGGTCC

Mutation

Figure 24.9: Mutation

24.6 Parameters of Genetic Programming


24.7 Genetic Algorithm Performance
There are a number of factors which affect the performance of a genetic algo-
rithm.

• The size of the population

• The cross-over probability

• The mutation probability

• Defining convergence

• Local optimisation

24.8 Genetic Algorithm for Sequence Alignment


Preliminaries
Lets assume we have a hypothetical DNA sequence, V in hand, and now we
want to find out one or more homologous DNA sequences to this. We also have
a database of vast number of known DNA sequences among which we would
like to find out the homologous one. The primary operation to do this is to
find out the best aligned sequence with sequence V . So simply, for each given
sequence (U ) from the database the query sequence, V will be aligned and the
best pairwise alignment will be taken. So a pairwise sequence alignment be-
ween the given sequence, U and the query sequence, V is needed to compute
first. There are several methods and algorithms for sequence alignment. In the
208 24. Genetic Programming And Bioinformatics

following section we will see how genetic algorithm can do sequence alignment.

The search for aligment (similarity) in the (DNA/RNA/protein) sequence anal-


ysis is the central to bioinformatics. In the sequence alignment problem a pair
of sequences is given, one is known or given sequence and other is the query
sequence. And a method for scoring a candidate alignment is also given. It is
to determine the correspondences between entire sequences or subsequences in
the sequences such that the alignment/similarity score is maximized.

Lets denote one of the known sequence from the database by U , of length m an
input query sequence denoted by V , of length n. The pairwise alignment prob-
lem will attempt to answer the question: ”How similar are the two sequences
U and V ?”. A candidate solution (aligned sequence) can be represented by
a matrix, called pairwise alignment matrix. A pairwise alignment matrix for
sequences U and V is a 2-row matrix constructed with (m + n) columns. Each
column contains only one characters of U , V , and at most one gap (−), that is a
column can not be entirely composed of gaps, except for the rightmost positions.
In the pairwise alignment matrix the original sequences may be augmented by
some gaps inside them, let denote these augmented sequences by U a and V a ,
of equal length (m + n). So the alignment matrix is the matrix P of dimension
2x(m + n) with row U a and V a .

U = u1 , u2 , u3 , . . . , um [where ui ∈ {A, T, C, G} for DNA Sequence]


V = v1 , v2 , v3 , . . . , vn [where vi ∈ {A, T, C, G}]

U a = ua1 , ua2 , ua3 , . . . , ual [where uai ∈ {A, T, C, G, −}]


a
V = v1a , v2a , v3a , . . . , vla [where via ∈ {A, T, C, G, −}]

Ua ua1 ua2 ua3 ual


   
...
P= = [where 1 ≤ l ≤ (m + n)] (24.1)
Va v1a v2a v3a ... vla

Each pairwise alignment matrix is a candidate solution for the alignment


problem. Lets assume Ω is the set of all possible pairwise alignment matrix for
a particular pair of augmented sequences (U a , V a ). There may have around 5l
numbers of possible alignment matrix. So the cardinality of Ω is of the order of
5l . This is obvious a very large number even for a small sequence.

Ω = All possible pairwise alignment between U a and V a


The cardinality of Ω =| Ω |
24.8. Genetic Algorithm for Sequence Alignment 209

l   
X l+k l
|Ω|=
k k
k=0
√ l (24.2)

= (3 + 2 2) [for very larger l]
∼ 5l
=

It is necessary to evaluate the quality of an alignment matrix. This will be


done by a scoring function, which will measure the quality of the alignment for
each column and then sum it up for the total matrix. The scoring function for
column i of alignment matrix P is described bellow.


a a
w1 (Ui , Vi ),

 if Uia = Via and Uia 6= ” − ”, Via =
6 ”−” ;
w2 (Uia , Via ), if Uia a a
= Vi and Ui 6= ” − ”, Vi =a
6 ”−” ;



f (Pi ) = max w3 (Uia , −), if Via = ” − ” and Uia 6= ” − ” ;

w4 (−, Via ), if Uia = ” − ” and Via 6= ” − ” ;




Uia = Via = ” − ”;

f ,
2 if
[w1 , w2 , w3 , w4 , f2 are alignment score matrix or function]
(24.3)

For simplicity, lets assume the scoring function as bellow.




1, if Uia = Via and Uia 6= ” − ”, Via =
6 ”−” ;
−1, if Uia a a
= Vi and Ui 6= ” − ”, Vi =a
6 ”−” ;



f (Pi ) = max −2, if Via = ” − ” and Uia 6= ” − ” ; (24.4)

−2, if Uia = ” − ” and Via 6= ” − ” ;




Uia = Via = ” − ”;

−3, if

So the fitness function for the pairwise alignment matrix will be

l
X
f (P ) = f (Pi ) (24.5)
i=1

Pairwise Sequence Alignment


In the context of genetic programming solutions for pairwise alignment will be
represented as chromosomes or individuals. And the fitness function will be
modeled by the equation 24.5. The genetic algorithm starts with a set of ran-
domly selected chromosomes as the initial population representing a set of pos-
sible pairwise alignment solutions. In genetic algorithm, variables of a problem
are represented as genes in a chromosome, chromosomes are evaluated according
210 24. Genetic Programming And Bioinformatics

to the fitness function values. Two genetic operations - crossover and muta-
tion, creates new chrosomes called offspring through altering the compositions
of genes. The selection operation will create populations from generation to
generation. And chromosomes with better fitness values have higher probabili-
ties of being selected in the next generation. And after several generations, the
algorithm will optimistically converge to the best solution.

Initial Population: The random creation of initial population of pairwise


alignment matrix will start from the individual P0 , which is a 2 ∗ (m + n) matrix
filled with coding alphabet {A, T, C, G} on the left columns and by gap alphabet
{” − ”} on the right most columns. Lets assume an example of given sequence
U = AAT T CCGG, m = len(U ) = 8, query sequence V = AT CG, n = len(V ) =
4.

 
A A T T C C G G − − − −
P00 = [initial individual]
A T C G − − − − − − − −

This is the first individual of generation-0, P00 . Some initial population can
be created with some probability function distributing the gaps over the row.
This is called the initial population of generation-0, Pi0 .
 
A A T − C C − G − T − G
P01 =
A − − G − − T − − C − −
 
A − T − C − A G − T C G
P02 =
A − − − − − T G − C − −

Fitness Evaluation: The fitness function has been chosen earlier, which is
equation 24.5. Fitness scores of the initial population are measured. The best
individuals are those with highest scores.

Selection: At any time during the genetic programming process, the solu-
tions that give the best scores are selected and kept in the population while the
poor ones are automatically rejected so as to keep the population some specific
range. This constitutes the current generation of population (solutions).

Recombination or Crossingover: Two individuals (solutions) are selected


from the current generation. There may have different crossing over procedures.
Lets discuss one simple crossing over mechanism. In this case an exchange
between corresponding rows is made. In this process, two individuals (P1g , P2g )
are selected as parent from generation g for crossing over. Let P1g is with U1g and
V1g , and P2g is with U2g and V2g . Crossing over is done through random exchange
between U1g and U2g (respectively V1g and V2g ). Then the new offsprings are
added to the population.
24.9. Applications of Genetic Algorithm 211

DRAW IT
DRAW IT
DRAW IT

Figure 24.10: Crossing over: Need to Draw

DRAW IT
DRAW IT
DRAW IT

Figure 24.11: Mutation: Need to Draw

Mutation: Mutation operators will introduce more diversity in the popula-


tion through changing/mutating one or more genes from one individual chromo-
some. As the query sequence V needs to keep as it is, mutation operators can
only act on the gaps of the augmented sequence. So that, at any time during the
process, if all gaps are removed, the query sequence can be derived directly and
completely. There may have the following possible operations: openning a new
gap, closing an existing gap, extending a gap in size or reducing a gap in size.
These operations can be categorized into three mutation operators - i) block
mutation operator, this opens a new gap, ii) gap entension, which introduces
extra gaps randomly, and iii) gap reduction, which randomly removes gaps.

24.9 Applications of Genetic Algorithm


The area of applications of genetic algorithm is very wide.

24.10 Simulated Annealing


Another optimization technique is Simulated Annealing.
212 24. Genetic Programming And Bioinformatics

Start

Initial
Align-
memt, P00

Generation
of Initial
Popula-
tion, Pr0

Selection

Crossing
Over

Mutation

Fitness
Evaluation

New
Population

Solution

End

Figure 24.12: Sequence Alignment Using Genetic Algorithm:Under Construc-


tion
Part V

Bioinformatics Tools

213
215

Bioinformatics Tools . . . Bioinformatics Tools . . . Bioinformatics Tools .


. . Bioinformatics Tools . . . Bioinformatics Tools . . . Bioinformatics Tools .
. . Bioinformatics Tools . . . Bioinformatics Tools . . . Bioinformatics Tools .
. . Bioinformatics Tools . . . Bioinformatics Tools . . . Bioinformatics Tools .
. . Bioinformatics Tools . . . Bioinformatics Tools . . . Bioinformatics Tools .
. . Bioinformatics Tools . . . Bioinformatics Tools . . . Bioinformatics Tools .
. . Bioinformatics Tools . . . Bioinformatics Tools . . . Bioinformatics Tools .
. . Bioinformatics Tools . . . Bioinformatics Tools . . . Bioinformatics Tools .
. . Bioinformatics Tools . . .
216
Chapter 25

Python - Primer
Programming Language for
Bioinformatics

—Zohirul Alam Tiemoon

217
218 25. Python - Primer Programming Language for Bioinformatics
Chapter 26

Python And Bioinformatics

—Zohirul Alam Tiemoon

219
220 26. Python And Bioinformatics
Chapter 27

Tools and Libraries for


Bioinformatics

—Zohirul Alam Tiemoon


1. FASTA
2. BLASTA, PSI BLAST

221
222 27. Tools and Libraries for Bioinformatics
Part VI

Bioinformatics : Current &


Future

223
225

Bioinformatics : Current & Future . . . Bioinformatics : Current & Future .


. . Bioinformatics : Current & Future . . . Bioinformatics : Current & Future .
. . Bioinformatics : Current & Future . . . Bioinformatics : Current & Future .
. . Bioinformatics : Current & Future . . . Bioinformatics : Current & Future .
. . Bioinformatics : Current & Future . . . Bioinformatics : Current & Future .
. . Bioinformatics : Current & Future . . . Bioinformatics : Current & Future .
. . Bioinformatics : Current & Future . . . Bioinformatics : Current & Future .
. . Bioinformatics : Current & Future . . . Bioinformatics : Current & Future .
. . Bioinformatics : Current & Future . . . Bioinformatics : Current & Future
. . . Bioinformatics : Current & Future . . .
226
Chapter 28

Prominent Research Areas


in Bioinformatics

—Fokhruzzaman

227
228 28. Prominent Research Areas in Bioinformatics
Chapter 29

Endless Horizon of
Bioinformatics: Future
Directions

—Fokhruzzaman

229
230 29. Endless Horizon of Bioinformatics: Future Directions
Chapter 30

The Crazy Corner with


ALL WILD Imaginations

—Fokhruzzaman

Thought of Mind
Snippets of Thought: NOT sure if it would become another chapter or not
.. But I want to propose one area to put ALL our imaginations .. like ”the
Crazy Corner with ALL Wild Imaginations” ... Say, while working on the Intro
chapter .. I was reading ... ”Protein is also a necessary component in our diet
bcz animals can not synthesize all the amino acids and must obtain essential
amino acids from food. Through the process of digestion, animals break down
ingested protein into free amino acids that can be used for protein synthesis...”

I assume the Plant kingdom (am I correct here...?) can synthesize all the amino
acids themselves.. so they don’t need external food (as such) like animals...
So may be an wild imagination could be ... we discover some Gene Mutation
Technique (or Protein Synthesis Technique using our dynamic DNA parts...) so
that we, humans, can live without external food ... I imagine our Great Saints
”knew” this Gene Mutation (??) or ”Self-Amino-Acids-Synthesis-Techniques”
like Plants long back ... ;-)...We just need to re-discover that for mere mortals
like us ... ;-).... The Global Food Industry will NOT love my Crazy Idea here
... ;-)...

Study random walk in streets and random driving / traffic behaviors using
Human Genome .... ??

Like .. Labony’s idea of discovering Gold & Diamond Mountains in outer space
through Nano Technology .... ?? Labony .. pls elaborate on this .. frankly .. I
did not remember the full idea now ...

231
232 30. The Crazy Corner with ALL WILD Imaginations
Appendix A

Bioinformatics
Terminologies

Bioinformatics An inter disciplinary study of Life Science and Computer Sci-


ence.

233
234 A. Bioinformatics Terminologies
Appendix B

Amino Acid Lists

Amino Acid 1 Amino Acid 1 Amino Acid 1 Amino Acid 1

235
236 B. Amino Acid Lists
Appendix C

Book Layout

• Section I: Introduction...

– Chapter 01: Introduction to Bioinformatics: —Fokhruzzaman


– Chapter 02: Introduction to Cell Biology: —Farjana Khatun
– Chapter 03: Introduction to Genetics and Genomics: —Farjana
Khatun
– Chapter 04: Introduction to Proteomics: —Farjana Khatun
– Chapter 05: Some Bioinformatics Model Organisms: —Farjana Khatun
– Chapter 06: Computing Fundamentals for Bioinformatics: —Zohirul
Alam Tiemoon
– Chapter 07: Math Primer for Bioinformatics: —Zohirul Alam Tiemoon
– Chapter 08: Biological Processes, Experimental Methods & Machin-
ery: —Farjana Khatun

• Section II: Introduction to Bioinformatics Problems

– Chapter 09: DNA and Protein Sequencing: —Saddam Hossain


– Chapter 10: Genome Mapping: —Saddam Hossain
– Chapter 11: Sequences Alignment: —Saddam Hossain
– Chapter 12: Gene Prediction: —Saddam Hossain
– Chapter 13: Genome Analysis: —Saddam Hossain
– Chapter 14: Phylogenetic Analysis: —Saddam Hossain
– Chapter 15: Protein Folding: —Saddam Hossain
– Chapter 16: Structural Bioinformatics and Drug Discovery: —
Saddam Hossain

• Section III: Introduction to Bioinformatics Computations

237
238 C. Book Layout

– Chapter 17: Statistical and Probabilistic Methods in Bioinformatics:


—Saddam Hossain
– Chapter 18: Computational Methods in Bioinformatics: —Saddam
Hossain
– Chapter 19: Bioinformatics Data Mining: —Md. Towhidul Islam
– Chapter 20: Some Algorithms in Bioinformatics: —Zohirul Alam
Tiemoon
• Section IV: Part-Some Widely Used Methods and Models in Bioinformat-
ics

– Chapter 21: Dynamic Programming And Bioinformatics: —Saddam


Hossain
– Chapter 22: Neural Network And Bioinformatics: —Md. Towhidul
Islam
– Chapter 23: Hidden Markov Model (HMM) And Bioinformatics: —
Saddam Hossain, Fokhruzzaman
– Chapter 24: Genetic Programming And Bioinformatics: —Saddam
Hossain
• Section V: Bioinformatics Tools
– Chapter 25: Python - Primer Programming Language for Bioinfor-
matics: —Zohirul Alam Tiemoon
– Chapter 26: Python And Bioinformatics: —Zohirul Alam Tiemoon
– Chapter 27: Tools and Libraries for Bioinformatics: —Zohirul Alam
Tiemoon

• Section VI: Bioinformatics : Current & Future


– Chapter 28: Prominent Research Areas in Bioinformatics: —Fokhruzzaman
– Chapter 29: Endless Horizon of Bioinformatics: Future Directions:
—Fokhruzzaman
– Chapter 30: The Crazy Corner with ALL WILD Imaginations: —
Fokhruzzaman
Bibliography

[1] M.D Bryan Bergeron. Bioinformatics Computing. Prentice Hall of India


Private Limited, New Delhi - 110 001, 2003.

239
Index

DNA Landmarks, 70
DNA Map, 70

Restriction Mapping, 70

240

You might also like