100% found this document useful (1 vote)
365 views

Computational Methods For Bioinformatics in Python 3.4

Uploaded by

Fareesha Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
365 views

Computational Methods For Bioinformatics in Python 3.4

Uploaded by

Fareesha Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 569

Computational Methods for Bioinformatics

in Python 3.4

Jason M. Kinser, D.Sc.


George Mason University
© January 19, 2017
Copyright

Text and images (excepting those attributed to other sources) copyright ©1st edition
2016, 2nd edition 2017 by Jason M. Kinser
Front cover copyright ©2017 by Jason M. Kinser

This document is intended for educational and may not be freely distributed in
written or electronic form without the expressed, written consent of the author.
Python scripts are provided as an educational tool. They are offered without guar-
antee of effectiveness or accuracy. Python scripts composed by the author may not be
used for commercial uses without the author’s explicit written permission.
Feedback
This is an active document in that it will be updated as the sciences, algorithms and
Python scripting methods change. The author does appreciated kind notes that inform
him of alterations needed and errors detected. Please send comments, suggestions and
error reports to: [email protected]

Versions
Version 1.0 September 1, 2016
Version 2.0 January 20, 2017

i
Dedication
This book is dedicated to
Dr. Wallace A. Hilton
and
Dr. Charles D. Geilker
both of whom encouraged young scientists to write.

ii
Contents

Contents i

Preface 1

I Computing in Office Software 3

1 Mathematics Review 5
1.1 Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Power Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Quadratic Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Trigonometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Triangles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.4 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Scientific Writing 19

iii
2.1 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.2 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.3 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.4 Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Word Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 MS - Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 LATEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2.1 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2.2 Title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2.3 Headings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2.4 Cross References . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2.5 Figures and Captions . . . . . . . . . . . . . . . . . . . . . 26
2.2.2.6 Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2.7 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2.8 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.2.9 Final Comments . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 LibreOffice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.4 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.4.1 Google Docs . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.4.2 ABI Word . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.4.3 Zoho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.4.4 WPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Computing with a Spreadsheet 33


3.1 Creating Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Cell Referencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Copying Formulas with References . . . . . . . . . . . . . . . . . . . 35
3.2.2 Absolute Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Cell Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Introduction to Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

iv
3.3.1 The Sum Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Statistical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.3 Comparison Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Creating Basic Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Function Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Trendline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.2 Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Gene Expression Arrays: Excel 53


4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Comparing Multiple Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

II Python Scripting Language 67

5 Python Installation 69
5.1 Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.2 MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.3 UNIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Setting up a Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Online Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6 Python Data and Computations 73


6.1 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.1 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.2 Simple Computations . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.3 Algebraic Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.4 The Math Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

v
6.3 Python Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3.1 Tuple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3.2 List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.3 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.4 Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.5 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.1 String Definition and Slicing . . . . . . . . . . . . . . . . . . . . . . 86
6.4.1.1 Special Characters . . . . . . . . . . . . . . . . . . . . . . . 87
6.4.1.2 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4.1.3 Repeating Characters . . . . . . . . . . . . . . . . . . . . . 87
6.4.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.2.1 Replacing Characters . . . . . . . . . . . . . . . . . . . . . 89
6.4.2.2 Replacing Characters with a Table . . . . . . . . . . . . . . 90
6.5 Converting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.6 Example: Romeo and Juliet . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7 Python Logic Control 95


7.1 The if Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.1.1 The else Command . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.1.2 Complex Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.1.3 The elif Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2 Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2.1 The while Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2.2 The for Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2.3 break and continue . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2.4 The enumerate Function . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3.1 Example: The Average of Random Numbers . . . . . . . . . . . . . 102
7.3.2 Example: Text Search . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.3.3 Example: Sliding Block . . . . . . . . . . . . . . . . . . . . . . . . . 105

vi
7.3.4 Example: Compute π . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.3.5 Example: Summation Equations . . . . . . . . . . . . . . . . . . . . 108
7.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8 Input and Output 111


8.1 Reading a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.2 Storing Data in a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.3 Moving the Position in the File . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.4 Pickle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.5.1 Sliding Window in DNA . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.5.2 Example: Reading a Spreadsheet . . . . . . . . . . . . . . . . . . . . 116

9 Python and Excel 121


9.1 Text Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.2 The csv Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.3 xlrd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.4 Openpyxl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

10 Reading a Binary File 129


10.1 A Brief Overview of a Sequencer . . . . . . . . . . . . . . . . . . . . . . . . 129
10.2 Hexadecimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
10.3 The ABI File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
10.3.1 ABI Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.3.2 Extracting the Records . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.3.2.1 The Base Calls . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.3.2.2 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.3.3 Cohesive Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

11 Python Arrays 141


11.1 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.2 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

vii
11.3 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.4 Mathematics and Some Functions . . . . . . . . . . . . . . . . . . . . . . . . 144
11.5 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.6 Example: Extract Random Numbers Above a Threshold . . . . . . . . . . . 150
11.7 Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.8 Example: Simultaneous Equations . . . . . . . . . . . . . . . . . . . . . . . 155

12 Python Functions and Modules 159


12.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
12.1.1 Basic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
12.1.2 Local and Global Variables . . . . . . . . . . . . . . . . . . . . . . . 160
12.1.3 Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.1.4 Default Argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.1.5 Help Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.1.6 Return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.1.7 Designing a Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.2 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
12.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

13 Object Oriented Programming 173


13.1 Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
13.2 Basic Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
13.2.1 Class with a Function . . . . . . . . . . . . . . . . . . . . . . . . . . 174
13.2.2 Self . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.2.3 Global and Local Variables . . . . . . . . . . . . . . . . . . . . . . . 176
13.2.4 Operator Overloading . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.2.5 Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
13.2.6 Actively Adding a Variable . . . . . . . . . . . . . . . . . . . . . . . 180

14 Random Numbers 183


14.1 Simple Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
14.2 Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

viii
14.3 Gaussian Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
14.3.1 Gaussian Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
14.3.2 Gaussian Distributions in Excel . . . . . . . . . . . . . . . . . . . . . 187
14.3.3 Histogram in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
14.3.4 Random Gaussian Numbers . . . . . . . . . . . . . . . . . . . . . . . 190
14.4 Multivariate Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
14.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
14.5.1 Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
14.5.2 Cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
14.5.3 Random DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

15 Gene Expression Arrays: Python 199


15.1 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
15.2 A Single File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
15.3 Multiple Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

III Computational Applications 207

16 DNA as Data 209


16.1 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
16.2 Application: Checking Genes . . . . . . . . . . . . . . . . . . . . . . . . . . 213
16.2.1 Reading the DNA File . . . . . . . . . . . . . . . . . . . . . . . . . . 213
16.2.2 Reading the Bounds File . . . . . . . . . . . . . . . . . . . . . . . . . 214
16.2.3 Examining the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

17 Application in GC Content 219


17.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
17.2 Python Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
17.3 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
17.3.1 Non-Coding Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
17.3.2 Coding Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
17.3.3 Preceding Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

ix
17.3.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

18 DNA File Formats 227


18.1 FASTA Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
18.2 Genbank Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
18.2.1 File Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
18.2.2 Parsing the DNA String . . . . . . . . . . . . . . . . . . . . . . . . . 230
18.2.3 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
18.2.4 Extracting Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
18.2.5 Coding DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
18.2.6 Extracting Translations . . . . . . . . . . . . . . . . . . . . . . . . . 234
18.3 ASN.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
18.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

19 Principle Component Analysis 241


19.1 The Purpose of PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
19.2 Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
19.2.1 Introduction to the Covariance Matrix . . . . . . . . . . . . . . . . . 242
19.2.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
19.3 Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
19.4 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 247
19.4.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
19.4.2 Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
19.4.3 Python Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
19.4.4 Distance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
19.4.5 Organization in PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
19.4.6 RGB Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
19.5 Describing Systems with Eigenvectors . . . . . . . . . . . . . . . . . . . . . 260
19.6 First Order Nature of PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
19.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

20 Codon Frequencies in Genomes 265

x
20.1 Codon Frequency Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
20.1.1 Codon Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
20.1.2 Codon Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
20.1.3 Codon Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
20.1.4 Frequencies of a Genome . . . . . . . . . . . . . . . . . . . . . . . . 267
20.2 Genome Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
20.2.1 Single Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
20.2.2 Two Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
20.3 Comparing Multiple Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . 271

21 Sequence Alignment 273


21.1 Simple Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
21.1.1 An Alphabet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
21.1.2 Considerations of Matching Sequences . . . . . . . . . . . . . . . . . 274
21.1.3 Insertions and Deletions . . . . . . . . . . . . . . . . . . . . . . . . . 274
21.1.3.1 Rearrangements . . . . . . . . . . . . . . . . . . . . . . . . 275
21.1.3.2 Sequence Length . . . . . . . . . . . . . . . . . . . . . . . . 275
21.1.4 Simple Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
21.1.4.1 Direct Alignment . . . . . . . . . . . . . . . . . . . . . . . 276
21.2 Statistical Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
21.2.1 Substitution Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
21.2.2 Accessing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
21.3 Brute Force Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
21.4 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
21.4.1 The Scoring Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
21.4.2 The Arrow Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
21.4.3 The Initial Program . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
21.4.4 The Backtrace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
21.4.5 Speed Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 288
21.5 Global and Local Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . 293
21.6 Gap Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

xi
21.7 Optimality in Dynamic Programming . . . . . . . . . . . . . . . . . . . . . 296
21.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

22 Simulated Annealing 301


22.1 Input to Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
22.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
22.3 A Perpendicular Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
22.4 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
22.5 Meaningful Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
22.6 Energy Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
22.7 Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
22.7.1 Swapping Letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
22.7.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
22.7.3 Consensus String . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

23 Genetic Algorithms 317


23.1 Energy Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
23.2 The Genetic Algorithm Approach . . . . . . . . . . . . . . . . . . . . . . . . 318
23.3 A Numerical GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
23.3.1 Initializing the Genes . . . . . . . . . . . . . . . . . . . . . . . . . . 319
23.3.2 The Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
23.3.3 The Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
23.3.4 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
23.3.5 Running the GA Algorithm . . . . . . . . . . . . . . . . . . . . . . . 322
23.4 Non-Numerical GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
23.4.1 Manipulating the Strings . . . . . . . . . . . . . . . . . . . . . . . . 324
23.4.2 The Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
23.4.3 The Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
23.4.4 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
23.4.5 Running the GA for Text Data . . . . . . . . . . . . . . . . . . . . . 329
23.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

xii
24 Multiple Sequence Alignment 331
24.1 Multiple Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
24.2 The Greedy Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
24.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
24.2.2 Theory of the Assembly . . . . . . . . . . . . . . . . . . . . . . . . . 334
24.2.3 An Intricate Example . . . . . . . . . . . . . . . . . . . . . . . . . . 334
24.2.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
24.2.3.2 Pairwise Alignments . . . . . . . . . . . . . . . . . . . . . . 335
24.2.3.3 Initial Contigs . . . . . . . . . . . . . . . . . . . . . . . . . 337
24.2.3.4 Adding to a Contig . . . . . . . . . . . . . . . . . . . . . . 339
24.2.3.5 Joining Contigs . . . . . . . . . . . . . . . . . . . . . . . . 341
24.2.3.6 The Assembly . . . . . . . . . . . . . . . . . . . . . . . . . 343
24.3 The Non-Greedy Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
24.3.1 Creating Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
24.3.2 Steps in the Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . 348
24.3.3 The Test Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
24.3.4 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
24.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

25 Trees 355
25.1 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
25.2 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
25.3 Linked Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
25.4 Binary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
25.5 UPGMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
25.6 Non-Binary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
25.7 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
25.7.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
25.7.2 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
25.7.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
25.7.2.2 Scoring a Parameter . . . . . . . . . . . . . . . . . . . . . . 375

xiii
25.7.2.3 A Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
25.7.2.4 The Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
25.7.2.5 A Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

26 Clustering 383
26.1 Purpose of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
26.2 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
26.3 More Difficult Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
26.3.1 New Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . 393
26.3.2 Modification of k-means . . . . . . . . . . . . . . . . . . . . . . . . . 394
26.4 Dynamic k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
26.5 Comments on k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
26.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

27 Text Mining 405


27.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
27.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
27.3 Creating Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
27.4 Methods of Finding Root Words . . . . . . . . . . . . . . . . . . . . . . . . 408
27.4.1 Porter Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
27.4.2 Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
27.5 Document Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
27.5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
27.5.2 Word Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
27.5.3 Indicative Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
27.5.4 Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . 417
27.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

IV Database 419

28 Spreadsheet and Databases 421


28.1 The Movie Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422

xiv
28.2 The Query List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
28.3 Answering the Queries in a Spreadsheet . . . . . . . . . . . . . . . . . . . . 426

29 Common Database Interfaces 437


29.1 Differences to a Spreadsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
29.2 Tables Required . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
29.3 Common Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
29.3.1 Microsoft Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
29.3.2 LibreOffice Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
29.3.3 MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
29.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451

30 Fundamental Commands 453


30.1 Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
30.1.1 Establishing a Database . . . . . . . . . . . . . . . . . . . . . . . . . 453
30.1.2 Creating a Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
30.1.3 Loading Data into a Table . . . . . . . . . . . . . . . . . . . . . . . . 455
30.2 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
30.3 Privileges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
30.4 The Simple Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
30.4.1 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
30.4.1.1 Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
30.4.1.2 Decimals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
30.4.1.3 Floating Point . . . . . . . . . . . . . . . . . . . . . . . . . 459
30.4.1.4 Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
30.4.2 Default Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
30.4.3 Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
30.4.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
30.4.5 Enumeration and Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 461
30.4.6 Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
30.5 Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
30.6 Mathematics in MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462

xv
30.6.1 Math Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
30.6.2 Math Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
30.6.3 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
30.6.4 Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
30.6.5 Aggregate Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
30.6.6 Sample Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
30.7 String Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
30.8 Limits and Sorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
30.9 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
30.10Time and Date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
30.11Casting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
30.12Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
30.12.1 CASE-WHEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
30.12.2 The IF Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
30.12.3 The IFNULL Statement . . . . . . . . . . . . . . . . . . . . . . . . . 480
30.12.4 Natural Language Comparisons . . . . . . . . . . . . . . . . . . . . . 480
30.13Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483

31 Queries with Multiple Tables 485


31.1 Schema and Linking Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
31.1.1 Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
31.1.2 Linking Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
31.1.3 Combined with Functions . . . . . . . . . . . . . . . . . . . . . . . . 487
31.1.4 Using a Table Multiple Times . . . . . . . . . . . . . . . . . . . . . . 487
31.2 Joining Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
31.2.1 Left Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
31.2.2 Right Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
31.2.3 Other Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
31.2.4 Functional Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 500
31.3 Subqueries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
31.4 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501

xvi
31.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504

32 Connecting Python with MySQL 505


32.1 Connecting Python with MySQL . . . . . . . . . . . . . . . . . . . . . . . . 505
32.1.1 Making the Connection . . . . . . . . . . . . . . . . . . . . . . . . . 505
32.1.2 Queries from Python . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
32.1.3 Altering the Database . . . . . . . . . . . . . . . . . . . . . . . . . . 507
32.1.4 Multiple Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
32.2 The Kevin Bacon Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
32.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513

xvii
xviii
List of Figures

1.1 MS-Windows calculator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6


1.2 The graph of a second order polynomial. . . . . . . . . . . . . . . . . . . . . 7
1.3 The graph of a second order polynomial with two inputs. . . . . . . . . . . 8
1.4 Linear dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 A triangle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 A non-right triangle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Equal triangles within the enclosing rectangle. . . . . . . . . . . . . . . . . . 10
1.8 A circle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.9 A cube. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.10 A cylinder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.11 Coordinates of a data point. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.12 A right angle triangle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.13 A triangle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.14 A vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.15 Adding two vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.16 Subtracting vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1 A delightful experiment with soda and a mint. . . . . . . . . . . . . . . . . 20

3.1 A simple calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34


3.2 Referencing the contents of a cell. . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Referencing the contents of multiple cells. . . . . . . . . . . . . . . . . . . . 35
3.4 Cell references change as a formula is copied. . . . . . . . . . . . . . . . . . 35
3.5 A poor way of creating several similar computations. . . . . . . . . . . . . . 36

xix
3.6 A better way of creating several similar computations. . . . . . . . . . . . . 37
3.7 All formulas in column C reference cell B1. . . . . . . . . . . . . . . . . . . 37
3.8 Changing the name of a cell. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.9 Using the named cell in referenced computations. . . . . . . . . . . . . . . . 39
3.10 Computing the sum of a set of values. . . . . . . . . . . . . . . . . . . . . . 40
3.11 Computing the average and standard deviation. . . . . . . . . . . . . . . . . 40
3.12 Constructing an IF statement. . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.13 Copying the formula to cells in column C. . . . . . . . . . . . . . . . . . . . 42
3.14 Using the COUNTIF function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.15 Creating a line graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.16 Creating a scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.17 Altering the x axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.18 Accessing the Trendline tool. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.19 Trendline interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.20 Perfect fit trendline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.21 Trendline shown with noisy data. . . . . . . . . . . . . . . . . . . . . . . . . 47
3.22 Raw data which is a noisy bell curve. . . . . . . . . . . . . . . . . . . . . . . 48
3.23 The spreadsheet architecture for Solver. . . . . . . . . . . . . . . . . . . . . 49
3.24 The Solver interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.25 Plots of the original data and the Solver estimate. . . . . . . . . . . . . . . 50

4.1 A small portion of the detected data. . . . . . . . . . . . . . . . . . . . . . . 54


4.2 The pertinent data is copied to a new sheet. . . . . . . . . . . . . . . . . . . 55
4.3 The subtraction of the background. . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 The R vs G plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 The R/G vs I data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 The R/G vs I plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.7 The M vs A data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.8 The M vs A plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.9 Sorted data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.10 Sorted data with the average removed. . . . . . . . . . . . . . . . . . . . . . 60

xx
4.11 Plot of the data with the average removed. . . . . . . . . . . . . . . . . . . 61
4.12 A partial view of data from all of the files after LOESS normalization. . . . 62
4.13 The average and standard deviation of the first three files. . . . . . . . . . . 62
4.14 The data after the average is subtracted. . . . . . . . . . . . . . . . . . . . . 62
4.15 The data after division by the standard deviation. . . . . . . . . . . . . . . 63
4.16 Data available to answer the male-only question. . . . . . . . . . . . . . . . 63
4.17 Comparing the values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.18 Accessing conditional formatting. . . . . . . . . . . . . . . . . . . . . . . . . 64
4.19 Changing the format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.20 Partial results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.1 The top working directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71


5.2 The working subdirectory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.1 The sliding box problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


7.2 A circle inscribed in a square. . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8.1 Data in a spreadsheet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

9.1 Two pop up dialogs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122


9.2 Parts of a large spreadsheet. . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.3 The portion of the spreadsheet at the beginning of the raw data. . . . . . . 123

10.1 One channel from one lane. . . . . . . . . . . . . . . . . . . . . . . . . . . . 130


10.2 All four channels in a small segment. . . . . . . . . . . . . . . . . . . . . . . 130
10.3 The same signal after deconvolution. . . . . . . . . . . . . . . . . . . . . . . 131
10.4 The beginning of the hexdump. . . . . . . . . . . . . . . . . . . . . . . . . . 133
10.5 The hexdump including the location 01 F7 B3. . . . . . . . . . . . . . . . . 135

11.1 Isolating the pixels about the face. . . . . . . . . . . . . . . . . . . . . . . . 152


11.2 The electric circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

12.1 A help balloon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164


12.2 A help balloon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

xxi
12.3 Directory structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
12.4 Creating a new file in IDLE. . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
12.5 Contents of a module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
12.6 Changing the contents of a module. . . . . . . . . . . . . . . . . . . . . . . . 170

14.1 Histogram of random numbers. . . . . . . . . . . . . . . . . . . . . . . . . . 184


14.2 A repeating function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
14.3 The auto-correlation of zero-sum random numbers. . . . . . . . . . . . . . . 185
14.4 The auto-correlation of a repeating sequence. . . . . . . . . . . . . . . . . . 186
14.5 The Gaussian distribution.[Kernler, 2014] . . . . . . . . . . . . . . . . . . . 187
14.6 The Gaussian distribution in Excel. . . . . . . . . . . . . . . . . . . . . . . . 188
14.7 The Gaussian distribution in Excel. . . . . . . . . . . . . . . . . . . . . . . . 188
14.8 The Gaussian distribution in Excel. . . . . . . . . . . . . . . . . . . . . . . . 188
14.9 Selecting Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
14.10The popup menu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
14.11The results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
14.12The plot of the results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
14.13The help balloon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
14.14A Gaussian distribution in 2D.[Gnu, 2016] . . . . . . . . . . . . . . . . . . . 192
14.15A Gaussian distribution in 2D.[Bscan, 2013] . . . . . . . . . . . . . . . . . . 192
14.16Histogram of rolling 2 dice. . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

16.1 A simple depiction of a cell with a nucleus, cytoplasm, nuclear DNA and
mitochondrial DNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
16.2 A caricature of the double helix nature of DNA. . . . . . . . . . . . . . . . 210
16.3 The ribosome travels along the DNA using codon information to create a
chain of amino acids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
16.4 Codon to Amino Acid Conversion . . . . . . . . . . . . . . . . . . . . . . . . 212
16.5 Spliced segments of the DNA are used to create a single protein. . . . . . . 213

17.1 A sliding window with a width of 8 and a step of 4. . . . . . . . . . . . . . . 220


17.2 Gaussian distributions of the three cases. . . . . . . . . . . . . . . . . . . . 225

xxii
18.1 FASTA file example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
18.2 Genbank file example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
18.3 Genbank file example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
18.4 Information on an individual gene. . . . . . . . . . . . . . . . . . . . . . . . 231
18.5 Indications of complements and joins. . . . . . . . . . . . . . . . . . . . . . 233

19.1 Rotating data to remove one of the dimensions. . . . . . . . . . . . . . . . . 242


19.2 A small data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
19.3 Linking data in two columns. . . . . . . . . . . . . . . . . . . . . . . . . . . 243
19.4 Pictorial representation of the covariance matrix with white pixels repre-
senting the largest values.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
19.5 Two views of the data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
19.6 First principal component. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
19.7 Second and third principal components. . . . . . . . . . . . . . . . . . . . . 248
19.8 The first 20 eigenvalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
19.9 Image D72 from the Brodatz image set. . . . . . . . . . . . . . . . . . . . . 252
19.10The points projected into R2 space. . . . . . . . . . . . . . . . . . . . . . . . 253
19.11The scrambled image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
19.12Reconstruction using only 7 dimensions. . . . . . . . . . . . . . . . . . . . . 256
19.13Reconstruction using only 2 dimensions. . . . . . . . . . . . . . . . . . . . . 257
19.14An input image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
19.15An attempt at pixel isolation. . . . . . . . . . . . . . . . . . . . . . . . . . . 258
19.16Displaying the original data. The green points are those denoted in Figure
19.15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
19.17First two axes in PCA space. . . . . . . . . . . . . . . . . . . . . . . . . . . 259
19.18Points isolated from a simple threshold after mapping the data to PCA space.260
19.19The values of the variables in the system. . . . . . . . . . . . . . . . . . . . 261
19.20The evolution of the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
19.21A system caught in a limit cycle. . . . . . . . . . . . . . . . . . . . . . . . . 262
19.22Sensitivity analysis of the data. . . . . . . . . . . . . . . . . . . . . . . . . . 263
19.23PCA map of face pose images. . . . . . . . . . . . . . . . . . . . . . . . . . 264

xxiii
20.1 The statistics for an entire genome. . . . . . . . . . . . . . . . . . . . . . . . 269
20.2 The statistics for the first 20 codons for two genomes. . . . . . . . . . . . . 270
20.3 PCA mapping for several bacterial genomes. . . . . . . . . . . . . . . . . . . 271

21.1 The first column and row are filled in. . . . . . . . . . . . . . . . . . . . . . 284
21.2 The S1,1 cell is filled in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
21.3 The lines indicate which elements are computed in a single Python command.291
21.4 A pictorial view of the scoring matrix. Darker pixels relate to higher values
in the matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

22.1 The costs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304


22.2 An energy surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

23.1 A simple view of an energy surface. . . . . . . . . . . . . . . . . . . . . . . 318


23.2 Two parents are spliced to create two children. . . . . . . . . . . . . . . . . 320

24.1 Aligning sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331


24.2 Aligning sequences with strong and weak overlaps. . . . . . . . . . . . . . . 332
24.3 Aligning sequences for a consensus. . . . . . . . . . . . . . . . . . . . . . . . 348

25.1 The dictionary tree for the four words. . . . . . . . . . . . . . . . . . . . . . 356


25.2 A linked list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
25.3 A linked list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
25.4 A linked list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
25.5 A binary tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
25.6 A tree for sorting with incorrect positions of V1 and V4. . . . . . . . . . . . 360
25.7 A tree for sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
25.8 The affected nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
25.9 Removal of the first node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
25.10Replacing a hole. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
25.11Replacing a hole completion. . . . . . . . . . . . . . . . . . . . . . . . . . . 363
25.12The process of the second node. . . . . . . . . . . . . . . . . . . . . . . . . . 363
25.13The first pairing in the UPGMA. . . . . . . . . . . . . . . . . . . . . . . . . 364
25.14The second iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366

xxiv
25.15The third iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
25.16The third iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
25.17The tree for the results in Code 25.9. . . . . . . . . . . . . . . . . . . . . . . 367
25.18A nonbinary tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
25.19Data distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
25.20A decision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
25.21Closer to reality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
25.22Distribution of people for three variables. . . . . . . . . . . . . . . . . . . . 372
25.23A decision node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
25.24A decision tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

26.1 Sorted scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385


26.2 The Swiss roll data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
26.3 Clustering after k-means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
26.4 Clustering after converting data to radial polar coordinates. . . . . . . . . . 395
26.5 Clustering after modifying the k-means algorithm. . . . . . . . . . . . . . . 397
26.6 Five clusters data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
26.7 New clusters after splitting and combining. . . . . . . . . . . . . . . . . . . 402
26.8 Clusters after running Code 26.22. . . . . . . . . . . . . . . . . . . . . . . . 403

27.1 A simple suffix tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

28.1 The movie data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423


28.2 The actor data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
28.3 The connection between actors and movies. . . . . . . . . . . . . . . . . . . 424
28.4 The filter popup dialog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
28.5 The filter results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
28.6 Using the advanced features of the filter to remove duplicates. . . . . . . . . 428
28.7 The length of a string in a cell. . . . . . . . . . . . . . . . . . . . . . . . . . 429
28.8 Finding individuals with two parts to their first names. Each of the Value
fields contains a single blank space. . . . . . . . . . . . . . . . . . . . . . . . 430
28.9 Finding individuals with the same initials. . . . . . . . . . . . . . . . . . . . 430
28.10Sorting on two criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431

xxv
28.11A portion of the window that shows the average for each year. . . . . . . . 432
28.12The movies of aid = 281. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
28.13The logic flow for obtaining the name of a movie from two actors. . . . . . . 434
28.14Finding the common elements in two lists. . . . . . . . . . . . . . . . . . . . 434
28.15Counting the numer of movies for each actor. . . . . . . . . . . . . . . . . . 435

29.1 The movie database schema. . . . . . . . . . . . . . . . . . . . . . . . . . . 439


29.2 The opening selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
29.3 Importing from Excel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
29.4 Importing choices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
29.5 Selecting the data type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
29.6 Starting the query process. . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
29.7 Converting the the MySQL command view. . . . . . . . . . . . . . . . . . . 444
29.8 Entering the MySQL command. . . . . . . . . . . . . . . . . . . . . . . . . . 444
29.9 The initial dialog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
29.10The Copy Table dialog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
29.11Selecting data fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
29.12Setting the data types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
29.13Setting the primary key. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
29.14The main dialog. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
29.15Copying the data to a spreadsheet. . . . . . . . . . . . . . . . . . . . . . . . 448

31.1 The movies schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486


31.2 A query involving two tables. . . . . . . . . . . . . . . . . . . . . . . . . . . 488
31.3 Actors in movies with two named actors. . . . . . . . . . . . . . . . . . . . . 490
31.4 Inner join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
31.5 Left join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
31.6 Right join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
31.7 Outer join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
31.8 Other joins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500

xxvi
List of Tables

2.1 A table with random data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20


2.2 My Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.1 Math functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.1 Values of the variables during each iteration . . . . . . . . . . . . . . . . . . 109

10.1 Hexadecimal Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132


10.2 ABI Record. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

13.1 Operators than can be overloaded. . . . . . . . . . . . . . . . . . . . . . . . 178

18.1 Binary representation of nucleotides. . . . . . . . . . . . . . . . . . . . . . . 237


18.2 Binary to hexadecimal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

21.1 The BLOSUM50 matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277


21.2 Possible alignments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
21.3 Shifts for each iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

22.1 Simple Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

30.1 Integer Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459


30.2 Date and time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
30.3 Converting data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
30.4 Math operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
30.5 Math functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
30.6 Other operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464

xxvii
30.7 Aggregate functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
30.8 Pattern matching string operators. . . . . . . . . . . . . . . . . . . . . . . . 469
30.9 Informative string operators. . . . . . . . . . . . . . . . . . . . . . . . . . . 469
30.10Informative string operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
30.11Substring operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
30.12Capitalization operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
30.13Alteration operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
30.14Miscellaneous operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
30.15Casting Operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
30.16Decision operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479

xxviii
Python Codes

2.1 Minimal content. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23


2.2 Inclusion of packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Making a title. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Making headings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Referencing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Inserting a figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Inserting an equation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 Creating a table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.9 A bibliography entry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.10 Creating the bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.11 Creating the citation reference. . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Commands used in figure 3.23. . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 OSx commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Alternative OSx commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1 Variable assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Simple math. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Expontial notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4 Complex values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.5 Type conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6 Rounding error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.7 Integer division. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.8 Algebraic hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.9 Algebraic functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.10 Trigonometric functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.11 Exponential functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.12 A tuple. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.13 Accessing elements in a tuple. . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.14 Accessing the last elements in a tuple. . . . . . . . . . . . . . . . . . . . . . 81
6.15 Accessing consecutive elements in a tuple. . . . . . . . . . . . . . . . . . . . 82
6.16 Accessing consecutive elements at the end of a tuple. . . . . . . . . . . . . . 82
6.17 A list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.18 Changing an element in a tuple. . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.19 Appending an element to a list. . . . . . . . . . . . . . . . . . . . . . . . . . 82

xxix
6.20 A dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.21 Accessing data in a dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.22 Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.23 Length of a collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.24 Slicing examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.25 More slicing examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.26 Accessing a collection inside of a collection. . . . . . . . . . . . . . . . . . . 85
6.27 Insertion into a list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.28 The pop function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.29 The remove function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.30 Creating a string. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.31 Simple slicing in strings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.32 Special characters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.33 Concatenation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.34 Repeating characters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.35 Using the find function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.36 Using the count function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.37 Conversion to upper or lower case. . . . . . . . . . . . . . . . . . . . . . . . 89
6.38 Using the split and join functions. . . . . . . . . . . . . . . . . . . . . . . . 89
6.39 Using the replace function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.40 Creating a complement string. . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.41 Using the maketrans and translate functions. . . . . . . . . . . . . . . . . 91
6.42 Converting data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.43 Counting names in the play. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.44 The first Romeo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.45 Counting Romeo and Juliet at the end of sentences. . . . . . . . . . . . . . 92
6.46 Collecting individual words. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.1 The skeleton for a for loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 The if statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3 Two commands inside of an if statement. . . . . . . . . . . . . . . . . . . . 96
7.4 Using the else statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.5 A compound statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.6 A compound statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.7 Using parenthesis in a compound statement. . . . . . . . . . . . . . . . . . . 98
7.8 Using the elif statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.9 Using a while loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.10 Using a for loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.11 The range function in Python 2.7. . . . . . . . . . . . . . . . . . . . . . . . 100
7.12 Using the break statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.13 Using the continue statement. . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.14 Using the enumerate function. . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.15 Generating random numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.16 Collecting random numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.17 Computing the average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

xxx
7.18 A more efficient method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.19 Loading Romeo and Juliet. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.20 Capturing all of the words that follow ‘the ’. . . . . . . . . . . . . . . . . . . 104
7.21 Isolating unique words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.22 Computations for the sliding block. . . . . . . . . . . . . . . . . . . . . . . . 106
7.23 Computing π with random numbers. . . . . . . . . . . . . . . . . . . . . . . 107
7.24 The initial data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.25 Summing the values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.26 More efficient code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.27 Code for the average function. . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.1 Reading a file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.2 Accessing files in another directory. . . . . . . . . . . . . . . . . . . . . . . . 112
8.3 Opening a file for writing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.4 Opening a file for writing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.5 Using the seek command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.6 Saving data with the pickle module. . . . . . . . . . . . . . . . . . . . . . . 114
8.7 Loading data from the pickle module. . . . . . . . . . . . . . . . . . . . . . 114
8.8 Reading the DNA file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.9 Counting the occurrences of the letter ‘t’. . . . . . . . . . . . . . . . . . . . 115
8.10 A sliding window count. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.11 The sliding window for the entire string. . . . . . . . . . . . . . . . . . . . . 116
8.12 Reading the sales data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.13 Splitting the data on newline and tab. . . . . . . . . . . . . . . . . . . . . . 118
8.14 Splitting the first data line. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.15 Converting data to floats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.16 Converting all of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.1 Loading the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.2 Separating the rows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.3 Determining the columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.4 Gathering the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.5 Using the csv module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.6 Using the xlrd module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.7 Converting the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.8 Using openpyxl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.9 Alternative usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
10.1 Using Python for character conversions. . . . . . . . . . . . . . . . . . . . . 133
10.2 ABI version number. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.3 Reading the first record. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.4 Interpreting the bytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.5 The ReadRecord function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.6 The base calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.7 The first data record. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.8 Retrieving the first channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.9 The ReadPBAS function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

xxxi
10.10The ReadData function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
10.11The SaveData function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.12The Driver function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.1 Creating a vector of zeros. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.2 Creating other types of vectors. . . . . . . . . . . . . . . . . . . . . . . . . . 142
11.3 Setting the printing precision . . . . . . . . . . . . . . . . . . . . . . . . . . 142
11.4 Creating a matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.5 Creating a matrix of random values. . . . . . . . . . . . . . . . . . . . . . . 143
11.6 Extracting elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.7 Extracting a sub-matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.8 Extracting qualifying indexes. . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.9 Extracting qualifying elements. . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.10Modifying the matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
11.11Adding two matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
11.12Addition of arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.13Elemental subtraction and multiplication. . . . . . . . . . . . . . . . . . . . 146
11.14Dot product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.15Matrix dot product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.16Transpose and inverse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.17Matrix inversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.18Some functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.19Retrieving information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.20Varieties of summation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
11.21Finding the maximum value. . . . . . . . . . . . . . . . . . . . . . . . . . . 149
11.22Using argsort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
11.23Using divmod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
11.24Extracting qualifying values. . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.25Using the indices function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
11.26Shifting the arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
11.27The distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
11.28The average of an area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
11.29Solving simultaneous equations. . . . . . . . . . . . . . . . . . . . . . . . . . 155
12.1 A basic function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
12.2 Attempting to access a local variable outside of the function. . . . . . . . . 160
12.3 Executing a function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.4 Using the global command. . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.5 Using an argument. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.6 Using two arguments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.7 Incorrect use of an argument. . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.8 A default argument. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.9 Multiple default arguments. . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.10The help function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.11Adding comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.12Using help on a user-defined function. . . . . . . . . . . . . . . . . . . . . . 165

xxxii
12.13Using the return command. . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.14Returning multiple values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
12.15Function outlining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
12.16Adding a command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
12.17Adding the rest of the commands . . . . . . . . . . . . . . . . . . . . . . . . 167
12.18Example calls of a function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
12.19The os and sys modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
12.20Importing a module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
12.21Reloading a module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
12.22Using the from ... import construct. . . . . . . . . . . . . . . . . . . . . 171
12.23Executing a file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
13.1 A very basic class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.2 A string example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.3 Demonstrating the importance of self. . . . . . . . . . . . . . . . . . . . . 176
13.4 Distinguishing local and global variables. . . . . . . . . . . . . . . . . . . . . 176
13.5 Theoretical code showing implementation of a new definition for the addi-
tion operator.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.6 Overloading the addition operator. . . . . . . . . . . . . . . . . . . . . . . . 178
13.7 Examples for overloading slicing and string conversion. . . . . . . . . . . . . 179
13.8 An example of inheritance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.9 Creating new variables after the creation of an object. . . . . . . . . . . . . 181
14.1 A random number. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
14.2 Many random numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
14.3 A correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
14.4 A histogram in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
14.5 Help on a normal distribution. . . . . . . . . . . . . . . . . . . . . . . . . . 191
14.6 A normal distribution in Python. . . . . . . . . . . . . . . . . . . . . . . . . 191
14.7 A larger distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
14.8 A multivariate distribution in Python. . . . . . . . . . . . . . . . . . . . . . 193
14.9 Computing the statistics of a large multivariate distribution. . . . . . . . . 193
14.10Random dice rolls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
14.11Random dice rolls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
14.12Distribution of a large number of rolls. . . . . . . . . . . . . . . . . . . . . . 194
14.13Random cards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
14.14Shuffled cards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
14.15Random DNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
15.1 The LoadExcel function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
15.2 The Ldata2Array function. . . . . . . . . . . . . . . . . . . . . . . . . . . 201
15.3 The MA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
15.4 The Plot function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
15.5 The LOESS function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
15.6 Processing a single file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
15.7 The GetNames function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
15.8 The AllFiles function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

xxxiii
15.9 The Select function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
15.10The Isolate function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
16.1 The LoadDNA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
16.2 Using the LoadDNA function. . . . . . . . . . . . . . . . . . . . . . . . . . 214
16.3 The LoadBounds function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
16.4 Length of a gene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
16.5 Considering a complement string. . . . . . . . . . . . . . . . . . . . . . . . . 216
16.6 The CheckForStartsStops function. . . . . . . . . . . . . . . . . . . . . . 217
16.7 The final test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
17.1 The GCcontent function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
17.2 Using the GCcontent function. . . . . . . . . . . . . . . . . . . . . . . . . 220
17.3 Loading data for mycobacterium tuberculosis. . . . . . . . . . . . . . . . . . 221
17.4 The Noncoding function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
17.5 The StatsOf function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
17.6 The statistics from the non-coding regions. . . . . . . . . . . . . . . . . . . 223
17.7 The Coding function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
17.8 The statistics from the coding regions. . . . . . . . . . . . . . . . . . . . . . 223
17.9 The Precoding function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
17.10The statistics from the pre-coding regions. . . . . . . . . . . . . . . . . . . . 224
18.1 Reading a file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
18.2 Displaying the contents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
18.3 Creating a long string. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
18.4 Performing all in a single command. . . . . . . . . . . . . . . . . . . . . . . 229
18.5 The ReadFile function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
18.6 Calling the ParseDNA function. . . . . . . . . . . . . . . . . . . . . . . . . 231
18.7 Using the FindKeyWords function. . . . . . . . . . . . . . . . . . . . . . . 232
18.8 Results from GeneLocs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
18.9 The Complement function. . . . . . . . . . . . . . . . . . . . . . . . . . . 234
18.10Calling the GetCodingDNA function. . . . . . . . . . . . . . . . . . . . . 234
18.11Using the Translation function. . . . . . . . . . . . . . . . . . . . . . . . . 235
18.12The opening lines of an ASN.1 file. . . . . . . . . . . . . . . . . . . . . . . . 236
18.13The DNA section in an ASN.1 file. . . . . . . . . . . . . . . . . . . . . . . . 236
18.14The DecoderDict function. . . . . . . . . . . . . . . . . . . . . . . . . . . 237
18.15The DNAFromASN1 function. . . . . . . . . . . . . . . . . . . . . . . . . 238
18.16DNA locations within an ANS.1 file.. . . . . . . . . . . . . . . . . . . . . . . 239
19.1 The covariance matrix of random data. . . . . . . . . . . . . . . . . . . . . . 244
19.2 The covariance matrix of modified data. . . . . . . . . . . . . . . . . . . . . 244
19.3 Testing the eigenvector engine. . . . . . . . . . . . . . . . . . . . . . . . . . 246
19.4 Proving that the eigenvectors are orthonormal. . . . . . . . . . . . . . . . . 247
19.5 The PCA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
19.6 The Project function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
19.7 The AllDistances function. . . . . . . . . . . . . . . . . . . . . . . . . . . 251
19.8 The distance test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
19.9 The first two dimensions in PCA space. . . . . . . . . . . . . . . . . . . . . 253

xxxiv
19.10The ScrambleImage function. . . . . . . . . . . . . . . . . . . . . . . . . . 254
19.11The process of unscrambling the rows. . . . . . . . . . . . . . . . . . . . . . 255
19.12The Unscramble function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
19.13Various calls to the Unscramble function. . . . . . . . . . . . . . . . . . . 256
19.14The LoadImage and IsoBlue functions. . . . . . . . . . . . . . . . . . . . 258
19.15Running a system for 20 iterations. . . . . . . . . . . . . . . . . . . . . . . . 260
19.16Computing data for a limit cycle. . . . . . . . . . . . . . . . . . . . . . . . . 262
20.1 The CodonTable function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
20.2 The CountCodons function. . . . . . . . . . . . . . . . . . . . . . . . . . . 267
20.3 Computing the codon frequencies. . . . . . . . . . . . . . . . . . . . . . . . 267
20.4 The CodonFreqs function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
20.5 The GenomeCodonFreqs function. . . . . . . . . . . . . . . . . . . . . . . 268
20.6 Calling the Candlesticks function. . . . . . . . . . . . . . . . . . . . . . . . 269
20.7 Creating plots for two genomes. . . . . . . . . . . . . . . . . . . . . . . . . . 270
21.1 The SimpleScore function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
21.2 Accessing the BLOSUM50 matrix and its associated alphabet. . . . . . . . 278
21.3 Accessing an element in the matrix. . . . . . . . . . . . . . . . . . . . . . . 278
21.4 Accessing an element in the matrix. . . . . . . . . . . . . . . . . . . . . . . 279
21.5 The BlosumScore function. . . . . . . . . . . . . . . . . . . . . . . . . . . 280
21.6 The BruteForceSlide function. . . . . . . . . . . . . . . . . . . . . . . . . 281
21.7 Aligning the sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
21.8 Creating two similar sequences. . . . . . . . . . . . . . . . . . . . . . . . . . 284
21.9 The ScoringMatrix function. . . . . . . . . . . . . . . . . . . . . . . . . . 286
21.10Using the ScoringMatrix function. . . . . . . . . . . . . . . . . . . . . . . 287
21.11The arrow matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
21.12The Backtrace function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
21.13The FastSubValues function. . . . . . . . . . . . . . . . . . . . . . . . . . 290
21.14The CreateIlist function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
21.15Using the CreateIlist function. . . . . . . . . . . . . . . . . . . . . . . . . . 292
21.16The FastNW function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
21.17Using the FastNW function. . . . . . . . . . . . . . . . . . . . . . . . . . . 293
21.18Results from the FastSW function. . . . . . . . . . . . . . . . . . . . . . . 295
21.19A local alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
21.20An example alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
21.21Returned alignments are considerably longer than 10 elements. . . . . . . . 297
22.1 The CostFunction function. . . . . . . . . . . . . . . . . . . . . . . . . . . 302
22.2 The RunAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
22.3 Using the RunAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . . 304
22.4 The GenVectors function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
22.5 The modified RunAnn function. . . . . . . . . . . . . . . . . . . . . . . . . 306
22.6 Using the RunAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . . 306
22.7 An example with a decay that is too fast. . . . . . . . . . . . . . . . . . . . 306
22.8 Checking the answer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
22.9 The RandomSwap function. . . . . . . . . . . . . . . . . . . . . . . . . . . 310

xxxv
22.10The CostFunction function. . . . . . . . . . . . . . . . . . . . . . . . . . . 310
22.11The AlphaAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
22.12Using the AlphaAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . 311
22.13An alignment score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
22.14The CostFunction function. . . . . . . . . . . . . . . . . . . . . . . . . . . 313
22.15Examples of the cost function. . . . . . . . . . . . . . . . . . . . . . . . . . 313
22.16The TestData function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
22.17The RandomLetter function. . . . . . . . . . . . . . . . . . . . . . . . . . 314
22.18The RunAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
22.19Comparing the computed result to the original. . . . . . . . . . . . . . . . . 315
23.1 The CostFunction function. . . . . . . . . . . . . . . . . . . . . . . . . . . 320
23.2 Employing the CrossOver function. . . . . . . . . . . . . . . . . . . . . . . 321
23.3 Employing the CrossOver function. . . . . . . . . . . . . . . . . . . . . . . 321
23.4 The first elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
23.5 The DriveGA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
23.6 A typical run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
23.7 Copying textual data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
23.8 The Jumble function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
23.9 Using the Jumble function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
23.10The CostFunction function. . . . . . . . . . . . . . . . . . . . . . . . . . . 326
23.11The Legalizefunction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
23.12Using the Legalizefunction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
23.13The modified Mutate function. . . . . . . . . . . . . . . . . . . . . . . . . . 328
23.14The DriveSortGA function. . . . . . . . . . . . . . . . . . . . . . . . . . . 329
24.1 The ChopSeq function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
24.2 Using the ChopSeq function. . . . . . . . . . . . . . . . . . . . . . . . . . . 334
24.3 Extracting a protein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
24.4 Creating the segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
24.5 Pairwise alignments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
24.6 Starting the assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
24.7 Use the ShiftedSeqs function. . . . . . . . . . . . . . . . . . . . . . . . . . 337
24.8 Using the NewContig function. . . . . . . . . . . . . . . . . . . . . . . . . 338
24.9 Finding the next largest element. . . . . . . . . . . . . . . . . . . . . . . . . 338
24.10Creating a second contig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
24.11Determining that the action is to add to a contig. . . . . . . . . . . . . . . . 339
24.12Using the Add2Contig function. . . . . . . . . . . . . . . . . . . . . . . . . 340
24.13Do nothing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
24.14The third contig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
24.15Adding to a contig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
24.16Locating contigs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
24.17Joining contigs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
24.18Showing a latter portion of the assembly. . . . . . . . . . . . . . . . . . . . 342
24.19The Assemble function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
24.20Running the assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

xxxvi
24.21The commands for an assembly. . . . . . . . . . . . . . . . . . . . . . . . . . 346
24.22Using the BestPairs function. . . . . . . . . . . . . . . . . . . . . . . . . . 346
24.23Showing two parts of the assembly. . . . . . . . . . . . . . . . . . . . . . . . 347
24.24The ConsensusCol function. . . . . . . . . . . . . . . . . . . . . . . . . . . 348
24.25The CatSeq function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
24.26The InitGA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
24.27The CostAllGenes function. . . . . . . . . . . . . . . . . . . . . . . . . . . 349
24.28Using the CostAllGenes function. . . . . . . . . . . . . . . . . . . . . . . . 350
24.29Using the CostAllGenes function for the offspring. . . . . . . . . . . . . . 350
24.30The RunGA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
24.31Using the Assemble function. . . . . . . . . . . . . . . . . . . . . . . . . . 351
25.1 A slow method to find a maximum value. . . . . . . . . . . . . . . . . . . . 356
25.2 Using commands to sort the data. . . . . . . . . . . . . . . . . . . . . . . . 357
25.3 Populating the dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
25.4 Printing the results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
25.5 Initiating a tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
25.6 Creating data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
25.7 Making M and partially filling it with data. . . . . . . . . . . . . . . . . . . 365
25.8 Altering M after the creation of a new vector. . . . . . . . . . . . . . . . . . 366
25.9 The UPGMA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
25.10Using the Convert function. . . . . . . . . . . . . . . . . . . . . . . . . . . 369
25.11The FakeDtreeData function. . . . . . . . . . . . . . . . . . . . . . . . . . 374
25.12Using the FakeDtreeData function. . . . . . . . . . . . . . . . . . . . . . . 375
25.13Separating the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
25.14Concepts of the ScoreParam function. . . . . . . . . . . . . . . . . . . . . 376
25.15The variable and function names in the Node class. . . . . . . . . . . . . . . 377
25.16The titles in the TreeClass. . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
25.17Initializing the Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
25.18The information of the mother node. . . . . . . . . . . . . . . . . . . . . . . 380
25.19Making the tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
25.20Comparing the patient to the first node. . . . . . . . . . . . . . . . . . . . . 381
25.21The final node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
25.22Running a trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
26.1 The CData function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
26.2 The CompareVecs function. . . . . . . . . . . . . . . . . . . . . . . . . . . 384
26.3 Saving the data for GnuPlot. . . . . . . . . . . . . . . . . . . . . . . . . . . 385
26.4 The CheapClustering function. . . . . . . . . . . . . . . . . . . . . . . . . 386
26.5 The ClusterVar function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
26.6 Initialization functions for k-means. . . . . . . . . . . . . . . . . . . . . . . 388
26.7 The AssignMembership function. . . . . . . . . . . . . . . . . . . . . . . 389
26.8 The ClusterAverage function. . . . . . . . . . . . . . . . . . . . . . . . . . 389
26.9 The KMeans function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
26.10A typical run of the k-means clustering algorithm. . . . . . . . . . . . . . . 391
26.11The MakeRoll function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

xxxvii
26.12The RunKMeans function. . . . . . . . . . . . . . . . . . . . . . . . . . . 392
26.13The GnuPlotFiles function. . . . . . . . . . . . . . . . . . . . . . . . . . . 392
26.14The GoPolar function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
26.15Calling the k-means function. . . . . . . . . . . . . . . . . . . . . . . . . . . 394
26.16The FastFloyd function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
26.17The Neighbors function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
26.18The AssignMembership function. . . . . . . . . . . . . . . . . . . . . . . 398
26.19A new problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
26.20Cluster variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
26.21The Split function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
26.22The final clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
27.1 The Hoover function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
27.2 The AllWordDict function. . . . . . . . . . . . . . . . . . . . . . . . . . . 407
27.3 A list of cleaned words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
27.4 The FiveLetterDict function. . . . . . . . . . . . . . . . . . . . . . . . . . 408
27.5 A few examples the failed in Porter Stemming. . . . . . . . . . . . . . . . . 409
27.6 The AllDcts function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
27.7 The GoodWords function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
27.8 The WordCountMat function. . . . . . . . . . . . . . . . . . . . . . . . . 413
27.9 A few statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
27.10The WordFreqMatrix function. . . . . . . . . . . . . . . . . . . . . . . . . 415
27.11The WordProb function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
27.12The IndicWords function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
27.13Using the IndicWords function. . . . . . . . . . . . . . . . . . . . . . . . . 416
27.14Scoring documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
29.1 An example query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
29.2 Connecting to MySQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
29.3 Creating a table in MySQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
29.4 Uploading a CSV file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
29.5 Using mysqldump. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
29.6 An example query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
30.1 Creating a database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
30.2 Creating a table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
30.3 Showing a table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
30.4 Describing a table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
30.5 Dropping a table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
30.6 Inserting data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
30.7 Multiple inserts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
30.8 Altering data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
30.9 Updating data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
30.10Granting privileges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
30.11The basic query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
30.12Selecting movies in a specified year. . . . . . . . . . . . . . . . . . . . . . . 458
30.13Creating a table with a default value. . . . . . . . . . . . . . . . . . . . . . 460

xxxviii
30.14Creating an enumeration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
30.15Example of CAST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
30.16Example of a math operator. . . . . . . . . . . . . . . . . . . . . . . . . . . 463
30.17Example of a math function. . . . . . . . . . . . . . . . . . . . . . . . . . . 463
30.18Selecting movies from a grade range. . . . . . . . . . . . . . . . . . . . . . . 466
30.19Selecting movies from a year range. . . . . . . . . . . . . . . . . . . . . . . . 467
30.20Selecting years with movie with a grade of 1. . . . . . . . . . . . . . . . . . 467
30.21Returning the number of actors from a specified movie. . . . . . . . . . . . 468
30.22The average grade of the movies in the 1950’s. . . . . . . . . . . . . . . . . 468
30.23A demonstration of AS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
30.24Statistics on the length of the movie name. . . . . . . . . . . . . . . . . . . 469
30.25Finding the Keatons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
30.26Finding the Johns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
30.27Finding the actors with two parts to their first name. . . . . . . . . . . . . . 472
30.28Finding the actors with identical initials. . . . . . . . . . . . . . . . . . . . . 472
30.29Example of the LIMIT function. . . . . . . . . . . . . . . . . . . . . . . . . 473
30.30Sorting a simple search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
30.31The movies with the longest titles. . . . . . . . . . . . . . . . . . . . . . . . 474
30.32Sorting actors by the location of ‘as’. . . . . . . . . . . . . . . . . . . . . . . 474
30.33Determining the average grade for each year. . . . . . . . . . . . . . . . . . 475
30.34Sorting the years by average grade. . . . . . . . . . . . . . . . . . . . . . . . 475
30.35Restricting the search to years with more than 5 movies. . . . . . . . . . . . 476
30.36Using CURDATE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
30.37Right now. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
30.38Casting data types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
30.39Using CASE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
30.40Using IF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
30.41Using IFNULL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
30.42The FULLTEXT operator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
30.43Load data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
30.44Using MATCH-AGAINST. . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
30.45Using QUERY-EXPANSION. . . . . . . . . . . . . . . . . . . . . . . . . . . 483
31.1 A query using two tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
31.2 A query using three tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
31.3 The average grade for John Goodman. . . . . . . . . . . . . . . . . . . . . . 488
31.4 Movies in French. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
31.5 Languages of Peter Falk movies. . . . . . . . . . . . . . . . . . . . . . . . . 489
31.6 Movies common to Daniel Radcliffe and Maggie Smith. . . . . . . . . . . . 491
31.7 Radcliffe’s aid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
31.8 Radcliffe’s mid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
31.9 Radcliffe’s mid with renaming. . . . . . . . . . . . . . . . . . . . . . . . . . 492
31.10The mids with both Smith and Radcliffe. . . . . . . . . . . . . . . . . . . . 493
31.11The aid of other actors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
31.12Unique actors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495

xxxix
31.13Actors common to movies with Daniel Radcliffe and Maggie Smith. . . . . . 495
31.14The mids for Cary Grant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
31.15The titles with ‘under’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
31.16Inner join with multiple returns. . . . . . . . . . . . . . . . . . . . . . . . . 498
31.17Left join with multiple returns. . . . . . . . . . . . . . . . . . . . . . . . . . 499
31.18Left excluding joins.[Moffatt, 2009] . . . . . . . . . . . . . . . . . . . . . . . 500
31.19The movie listed with each actor. . . . . . . . . . . . . . . . . . . . . . . . . 501
31.20The use of a subquery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
31.21Assigning an alias to a subquery. . . . . . . . . . . . . . . . . . . . . . . . . 502
31.22The top 5 actors in terms of number of appearances. . . . . . . . . . . . . . 503
31.23The actors with the best average scores. . . . . . . . . . . . . . . . . . . . . 503
32.1 Creating the connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
32.2 Sending a query and retrieving a response. . . . . . . . . . . . . . . . . . . . 506
32.3 Committing changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
32.4 Sending multiple queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
32.5 Sending multiple queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
32.6 Sending multiple queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
32.7 The DumpActors function. . . . . . . . . . . . . . . . . . . . . . . . . . . 510
32.8 The MakeG function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
32.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
32.10The RemoveBadNoP function. . . . . . . . . . . . . . . . . . . . . . . . . 512
32.11The path from Hanks to Sheen. . . . . . . . . . . . . . . . . . . . . . . . . . 513
32.12The Trace function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514

xl
Preface

This textbook is designed for students that have some background in biological sciences
but very little in computer programming. Students are expected to have a beginner’s
knowledge of how to use a computer which includes moving files, understanding file struc-
tures, rudimentary office software skills, and a cursory understanding of core computer
terms. Python scripting, however, is definitely a pre-requisite.
This book considers three main tools by which computations and data analysis of
biological data may be performed. These core competencies are the use of a spreadsheet,
the use of a computer language, and the use of a database engine. This text assumes
that the reader has very little experience in using a spreadsheet and no experience in the
programming language or the use of a database engine.
Advanced readers might find this text a bit frustrating as many of the examples
do not use the most efficient coding possible. The purpose of this text is to relay an
understanding of how algorithms work and how they can be employed. While coding
efficiency is an admirable competency, it is not the aim of this text since often the most
efficient codes are more difficult to understand.
Finally, it should be noted that biology is a vast field with many different areas of
research. This text only touches on a few of those areas. It would be nice to write a
comprehensive tomb on computations in the field of biology, but the author simple does
not have that many decades left on this planet.

Jason M. Kinser, D.Sc.


George Mason University
Fairfax, VA.

1
2
Part I

Computing in Office Software

3
Chapter 1

Mathematics Review

Algebra, geometry and trigonometry concepts will be used throughout this book. This
chapter reviews the basic concepts and establishes the notation that will be used in the
following chapters.

1.1 Algebra

In order to develop rigorous mathematical descriptions of problems it is necessary to


describe entities as variables. For example, the distance between two points is represented
by the variable d or the volume of a container is represented by the variable V . The use
of variables allows for the description of generic cases. For example, it is possible to show
that the area of a square which has sides of length 2 cm is 4 cm. However, that solution
only applies to that particular problem. Variables are used to describe the generic case as
in,
A = w × h = wh, (1.1)

where w is the width and h is the height of the square. Frequently, the multiplication sign
is omitted as shown. This equation applies to any rectangle instead of just one specific
rectangle. Thus, the use of variables is more descriptive of problem.
Variables are just one letter and subscripts are used to help delineate similar vari-
ables. Consider the case were there are two squares of different sizes. They are named
Square 1 and Square 2. The widths of the two objects are then described as w1 and
w2 . Thus, w still represents the width and the subscripts associate the variable to the
respective square.

1.1.1 Power Terms

Power terms are used to indicate repetitive multiplications. The square of a value is

5
defined as,
x2 = x × x = (x)(x), (1.2)

and the cube of a value is,

x3 = x × x × x. (1.3)

Similarly the square root for the case of x = y 2 is,



x = y. (1.4)

1.1.2 Calculator

Computer operating systems, such as Microsoft Windows, Apple OS X and the various
flavors of UNIX, all come with a calculator. Most of these have various modes including
a scientific mode which contains the trigonometric and algebraic functions.
The calculator from Microsoft Windows is shown Figure 1.1(a) in its default mode.
However there are other modes that are available as shown with the pulldown menu in
Figure 1.1(b). The selection of the scientific mode presents a different calculator which is
displayed in Figure 1.1(c). This offers trigonometric and algebraic functions.

(a) (b) (c)

Figure 1.1: MS-Windows calculator.

1.1.3 Polynomials

A polynomial function relates an output value to weighted linear combinations of power


terms. A quadratic equation with a single independent variable is,

y = ax2 + bx + c, (1.5)

6
where a, b and c are coefficients. The independent variable is x and the dependent variable
is y. Figure 1.2 shows the graph for the case of a = 0.02, b = −0.02 and c = 1. The
variable a controls the amount of bend in the graph, the b controls the horizontal location
and affects the vertical location, and the c affects the vertical location. A description
of this plot is “y vs. x” which displays the dependent variable versus the independent
variable.

Figure 1.2: The graph of a second order polynomial.

The input space is not restricted to a single independent variable. In the case of,

z = x + y − 6xy, (1.6)

the two independent variables are x and y while the dependent variable is z. This creates
a surface plot as shown in Figure 1.3. The x and y axes are the horizontal and axes and
the z corresponds to the vertical axis.

1.1.4 Quadratic Solution

A form of the quadratic equation is,

0 = ax2 + bx + c, (1.7)

and in many instances the values of a, b and c are known but the value of x is unknown.
There are actually two solutions to this equation since it is a second order polynomial. A
third order polynomial such as,

0 = ax3 + bx2 + cx + d (1.8)

can have up to three solutions for x.

7
Figure 1.3: The graph of a second order polynomial with two inputs.

The solutions to equation (1.7) can be determined by,


−b ± b2 − 4ac
x= , (1.9)
2a

where the ± can be either + or -. One of the two solutions uses + and the other uses -.
It is also possible that a solution to Equation (1.7) does not exist. In this case
b2 − 4ac is negative and the square root can not be computed.
Example.
Consider a = 0.303, b = 0.982 and c = −0.552. The solutions to Equation (1.7) are:
p
−0.982 ± (0.982)2 + 4(0.303)(0.552)
x= (1.10)
2(0.303)

The two values of x are 0.488 and -3.72.


These values can be confirmed by using either value of x in ax2 + bx + c. In both cases
the result of this equation should be 0.

8
1.2 Geometry

Figure 1.4 shows a rectangle with two linear dimensions. In this example there is a width
and a height. Both of these are lengths and the units for lengths is commonly in inches,
feet, miles, centimeters, meters or kilometers.

Figure 1.4: Linear dimensions.

The area of a rectangle is the width times the height. For the example in Figure 1.4
the area is,
A = hl. (1.11)

The triangle shown in Figure 1.5 is a right triangle which subtends half of the area
of a rectangle of the same height and width. Therefore the area is
1
A = hl. (1.12)
2
Since this is a right triangle the lengths of the sides are related by the Pythagorean
theorem. In this example,
p 2 = h2 + l 2 . (1.13)

Figure 1.5: A triangle.

Actually, any triangle subtends half of the area of the enclosing rectangle. Figure
1.6 shows a triangle that does not have any right angles but the area of this triangle is
still half of the area of the enclosing rectangle as in,
1
A = ab. (1.14)
2

9
Figure 1.6: A non-right triangle.

This property is easy enough to demonstrate with Figure 1.7. The original triangle is
sections II and III. However, within the enclosing rectangle there are two other triangles
of equal areas to the originals. Region I and II form a rectangle, and since the area of
a right triangle is half of the area of the triangle, both I and II must be half of the area
and therefore have equal areas. The same applies to regions III and IV . Thus, the total
area from II and III must be the same as the total area of I and IV , and finally the area
of II and III must be half of the area of the enclosing rectangle.

Figure 1.7: Equal triangles within the enclosing rectangle.

The area of a circle with radius r, as shown in Figure 1.8 is,

A = πr2 , (1.15)

and the area of the sphere is,


A = 4πr2 . (1.16)

Figure 1.8: A circle.

10
The area is just the outside of the object. The volume includes the interior. The
volume of a cube, such as the one shown in Figure 1.9, is the product of the three linear
dimensions,
V = abc, (1.17)
and the volume of a sphere is,
4
V = πr3 . (1.18)
3

Figure 1.9: A cube.

The volume of the cube shown in Figure 1.9 could also be considered as the area of
one face (A = ab) multiplied by the linear dimension that is perpendicular to the face,

V = Ac = abc. (1.19)

The same logic is applied in computing the volume of a cylinder.. This is the area of
the circle multiplied by the length of the perpendicular side. The volume of the cylinder
shown in Figure 1.10 is,
V = Az = πr2 z. (1.20)

Figure 1.10: A cylinder.

11
1.3 Trigonometry

This section reviews the basics of trigonometry.

1.3.1 Coordinates

Figure 1.11 shows a data point plotted on a graph. There are two common methods of
reference this point. In rectilinear coordinates the point is denoted by the horizontal and
vertical distances (x, y). In polar coordinates the point is referenced by its distance to the
original and the angle to the horizontal axis, (r, θ).

Figure 1.11: Coordinates of a data point.

There are other coordinate systems as well, but they all have one feature in common.
Since this point is in R2 (two-dimensional space) the representation of this point requires
two numerical values.

1.3.2 Triangles

A triangle is formed from the data point and the origin in Figure 1.11 This is a right
triangle which has several convenient properties. Figure 1.12 displays a right triangle with
side lengths of a, b and c. The Pythagorean theorem relates the length of the sides by

c2 = a2 + b2 . (1.21)

The angle θ relates to the sides through geometric relationships,


a
sin(θ) = , (1.22)
c

b
cos(θ) = , (1.23)
c
and

12
c
a
θ
b
Figure 1.12: A right angle triangle.

a
tan(θ) = . (1.24)
b
Likewise, the inverse functions are:
a
θ = sin−1 , (1.25)
c
 
−1 b
θ = cos , (1.26)
c
and a
θ = tan−1 . (1.27)
b
Figure 1.13 shows a different triangle that does have not a right angle. The sides
and the angles are related by two laws. An angle is related to the sides by the law of
cosines,
c2 = a2 + b2 − 2ab cos(γ). (1.28)
and the law of sines,
a b c
= = (1.29)
sin(α) sin(β) sin(γ)

β
a c

γ α
b
Figure 1.13: A triangle.

1.4 Linear Algebra

A vector is depicted as an arrow from the origin to a designated point in space such as the
one shown in Figure 1.14. Numerically, a vector is a one dimensional array of numerical
values. A matrix is a two dimensional array of numerical values. This section reviews of
the basic processes associated with vectors and matrices.

13
Figure 1.14: A vector.

1.4.1 Elements

A vector is a collection of numerical values. An example for a four dimensional vector is


~v = (0, 4, 3, 2). The arrow over the variable indicates that the variable is a vector (although
some texts prefer to use bold face script). An individual element of a vector is denoted
by a subscript and the arrow is removed. Thus, the elements in the example vector are
v1 = 0, v2 = 4, v3 = 3 and v4 = 2.
A vector with a subscript such as, ~vk would indicate that the vector is from a set.
In this case, this vector would be the k-th vector from a collection of vectors.
Multiple manners are used to represent a vector. Given a vector that that mea-
sures 3 in the horizontal direction and 2 in the vertical direction the different methods of
representing this vector are:

ˆ ~v = 1x̂ + 3ŷ,

ˆ ~v = 1î + 3ĵ, or

ˆ ~v = h1, 3i.

The î and x̂ are the same and just mean that this dimension is in the x, or horizontal,
direction. Likewise, ŷ and ĵ are the same and represent the vertical dimension. The in
the case of a three dimensional vector either ẑ or k̂ are used.

1.4.2 Length

The length of the vector is the hypotenuse along the triangle. Thus, Equation (1.21),
Pythagorean theorem, is used to compute the length of a vector.

14
1.4.3 Addition

The addition of two vectors is simply the addition of respective elements. Give two vectors
w
~ and ~v , the addition is,
~z = w
~ + ~v (1.30)

z i = wi + v i , ∀i = 1, ..., N (1.31)

where N is the length of the vectors, and the ∀ symbol means “for all”. Thus, the addition
is applied to all of the elements in the vector. Subtraction is similar except that the plus
sign is replaced by the minus sign.
Geometrically, the addition is shown as the correct placement of vectors. The ad-
dition of two vectors is shown in Figure 1.15. The tail of vector ~x is placed at the tip
of vector w.
~ The summation is ~z which starts at the tail of w ~ and ends at the tip of ~x.
Figure 1.16 shows the subtraction ~z = w ~ − ~x. The vector ~x is now reversed in direction
since it had that negative sign. The result is still the vector from the tail of w
~ to the tip
of ~x.

Figure 1.15: Adding two vectors.

Figure 1.16: Subtracting vectors.

15
1.4.4 Multiplication

Addition and subtraction of vectors is relatively straightforward. Multiplication, however,


is a not. There are four different ways in which two vectors can be multiplied together:

ˆ Elemental multiplication,

ˆ Inner product (dot product),

ˆ Outer product, or

ˆ Cross product.

Elemental multiplication is performed much like the method of addition,

z i = wi x i , ∀i. (1.32)

The inner product (also called a dot product), creates a scalar value, as

~ · ~x
f =w (1.33)
N
X
f= wi x i . (1.34)
i=1

The outer product creates a matrix,

Mi,j = wi xj . (1.35)

The cross product creates a vector that is perpendicular to both input vectors,

~ × ~x.
~z = w (1.36)

This computation is the determinant of the matrix,



î ĵ k̂

~z = wi wj wk (1.37)
xi xj xk

~z = (wj xk − wk xj )î − (wi xk − wk xi )ĵ + (wi xj − wj xi )k̂. (1.38)

1.5 Problems

1. What is the value of x2 if x = 3?

2. What is the value of x2 + x if x = 2?

16
3. What is the value of x3 if x = 3?

4. What is the value of x if x = 49?

5. What is the value of x for 0.1x2 + 4.3x − 9 = 0?

6. What is the value of x for 0.1x2 − 4.3x + 12 = 0?

7. What is the value of x for 0.3x2 − 4.3x = −7.4?

8. What is the area of a square with side lengths of 3 inches?

9. If the area of a square is 2, then what is the length of a side?

10. What is the area of a circle with a radius of 1.5 cm?

11. If the radius of a circle doubles, does the area also double?

12. What is the area of a cylinder with a radius of 1 and a height of 2?

~ = 3x̂ − 1ŷ. What is ~v + w?


13. Given ~v = 1x̂ + 2ŷ and w ~

~ = 3x̂ − 1ŷ. What is ~v − w?


14. Given ~v = 1x̂ + 2ŷ and w ~

~ = 3x̂ − 1ŷ. What is ~v · w?


15. Given ~v = 1x̂ + 2ŷ and w ~

16. Given ~v = 1x̂ + 2ŷ. What is the length of this vector?

17. Given the triangle in Figure 1.12 with a = 1 and c = 3. What is the angle θ?

18. Given the triangle in Figure 1.12 with a = 1 and b = 3. What is the angle θ?

19. Given the triangle in Figure 1.12 with a = 1 and θ = 30◦ . What is c?

17
18
Chapter 2

Scientific Writing

Communication is paramount in every field of science and engineering. Writing scientific


documents is the most popular form of transferring knowledge to a wide audience and
therefore it is important to have the skills to create meaningful documents. This chapter
reviews some of the trademarks of a quality written presentation.

2.1 Content

Most authors understand that the written document needs to follow basic language guide-
lines. The ensuing sections review some of the guidelines that are unique to scientific
writing.

2.1.1 Presentation

Except for rare occasions, scientific documents should be written in the third person. The
author is the observer and not the participant in the experiment. Therefore, the point of
view should not include words such as “I” or “we.”

2.1.2 Figures

Figures are common in documents and there are a few rules that should be heeded. First,
a figure should never be isolated in the document. The text must have a reference to
every figure. Second, the reader should not be required to interpret the figure to draw the
conclusions that the writer wishes to relay. The author must describe why the figure is
important and what is in the figure that proves their contentions.
All figures need a caption and a figure number. This caption is below the figure.
An example is shown in Figure 2.1 which shows the effect after a particular type of mint
is dropped into a particular bottle of soda.

19
Figure 2.1: A delightful experiment with soda and a mint.

2.1.3 Tables

Tables are treated in a similar manner to figures except for the location of the caption
which is at the top of a table. Once again, the text must have a reference to the table,
and content as well as the importance must be discussed. It is improper to state “the
table shows that the experiment is validated.” Instead, the author needs to explain how
the contents of the table validate their point.
An example is shown in Table 2.1 which shows the results from three experiments.

Table 2.1: A table with random data.

Experiment Result
1 3.423
2 6.432
3 9.243

2.1.4 Equations

Equations are an important format in which to deliver precise descriptions of theory or


processes. Equations can be presented in-line such as E = mc2 , or as a separate line either
with numbering,
E = mc2 , (2.1)
or without numbering,
E = mc2 .

20
In all cases, the equation is treated as part of the sentence. Thus, if an equation is the
last component of a sentence then a period must follow it.
All variables must be defined near the location where they are first encountered. For
example, in Equation (2.1), m is mass, c is the speed of light, and E is the rest energy of
that mass. Variables are presented in italics both in the equation and in the text. The
major exception is that matrices and tensors tend to be presented in bold, upright fonts.
Units, on the other hand, are presented in upright characters. For example, the mass of
an experimental object is written as m = 1.5 kg.
The derivative symbol, d, in calculus equations is upright as in,

df (x)
g(x) =
dx
or Z
f (x) = g(x) dx.

2.2 Word Processing

There are several software packages that can be used to create scientific documents. This
section highlights the advantages and disadvantages of the different choices.

2.2.1 MS - Word

Microsoft-Word is the most popular program used in writing documents. The advantages
are:

ˆ Almost everyone in business uses it which makes co-authoring plausible.

ˆ It has many different tools for many different styles of writing.

ˆ It can create word indexes.

Some of the disadvantages are:

ˆ It is expensive.

ˆ The equation editor is mediocre.

ˆ The bibliography manager is poor and incompatible with other systems.

ˆ It becomes slow for very large documents.

ˆ Figure captions disconnect from the figures.

21
ˆ Inline equations can not be made to look exactly like centered equations.

ˆ Proper equation numbering is a kluge.

ˆ It does not run on Unix.

2.2.2 LATEX

The LATEX program is a layout manager and not a word processor. Using LATEXbasically
requires learning a computer language. The advantages are:

ˆ It makes professional looking documents. (This book was written using this soft-
ware.)

ˆ It does a great job managing very large documents.

ˆ It has been around for decades and so there is a massive reserve of libraries to type-
set almost anything. There are libraries for IPA (international phonetic alphabet),
music, chemical reactions, etc.

ˆ Most serious scientific journals prefer LATEXover MS-Word and do provide a template.

ˆ It has a fantastic equation editor. Many websites (including Piazza.com) that allow
users to create equations are using LATEX.

ˆ Websites like Overleaf.com allow for multiple authors working simultaneously on the
same documents.

ˆ It works on all platforms.

ˆ Excellent management of citations using bibtex. Many journals provide bibtex for-
matted citations.

The disadvantages are:

ˆ It has a very steep learning curve.

ˆ Most editors are not WYSIWYG. Users must compile documents to see how they
will appear.

ˆ It can require a lot of files.

The best compilers are:

ˆ MS-Windows: MikTex

ˆ OSx: MacTex

22
ˆ UNIX: Use the package manager to download the compiler. The Kile editor is very
popular.

Since LATEXis a layout editor a few lines are required for any document. Code 2.1
shows the basic commands. Line 1 declares the document to be an article. Other options
include book, slides, letter, etc. Many other templates are available and most journals
or universities provide LATEXtemplates. Line 2 is a comment field that is not used in
compiling. Line 3 begins the body of the document. Line 4 is the text that is actually
seen in the document and Line 5 ends the document. Anything that is added to the file
after \end{document} is not considered by the compiler.

Code 2.1 Minimal content.


1 \ documentclass { article }
2 % document preparation
3 \ begin { document }
4 Body of document .
5 \ end { document }

2.2.2.1 Packages

LATEXhas many packages that can be loaded to create the correct type of document.
Popular packages are:

ˆ amsmath : Equation typing

ˆ color : Use of color in the text

ˆ graphicx : Use of graphics in the text including imported images.

ˆ fullpage : Allows the document to fill the page with smaller margins

ˆ units: Allows for proper typing of units for variables.

ˆ subfigure : Allows for the inclusion of multiple files in a single image in the document.

ˆ makeidx : Tools to create an index.

ˆ listings : Tools to include source code from many languages including color coding
and line numbering.

These are usually placed at the top of the document as shown in Code 2.2. There
are thousands of packages freely available that manage various types of typesetting.

23
Code 2.2 Inclusion of packages.

1 \ documentclass [11 pt ]{ article }


2 \ usepackage { amsmath , amssymb , amsfonts }
3 \ usepackage { makeidx }
4 \ usepackage { color }
5 \ usepackage { units }
6 \ usepackage { graphicx }
7 \ usepackage { url }
8 \ usepackage { subfigure }
9 \ usepackage { listings }
10

11 \ begin { document }
12 ...

2.2.2.2 Title

A title is easily created as shown in Code 2.3. The title usually contains the title name,
the author and the date. In this example these three are established in Lines 3, 4 and 5.
Line 8 places the title information at this location in the document with the \maketitle
command.

Code 2.3 Making a title.

1 \ documentclass [11 pt ]{ article }


2 ...
3 \ title { CDS 230}
4 \ author { Jason M . Kinser \\
5 {\ small \ em \ copyright \ Draft date \ today }}
6

7 \ begin { document }
8 \ maketitle
9 ...

The font size is established in line 1. The command \small starts the use of a
smaller font. The command \em creates italic text. The \copyright creates the ©
symbol. The \today command inserts the date when the file is compiled.

2.2.2.3 Headings

Headings are easily created using several commands depending on the heading level. Ex-
amples are shown in Code 2.4. Line 5 creates a new chapter heading. This books uses the
default styles for chapter headings. Chapter headings are available only if the document

24
class is a book. If the document class were an article then line 5 would not be allowed.
Line 6 starts a section heading and lines 7 and 8 create subheadings. The headings are
automatically numbered including the chapter number if the document is a book. Line 9
uses the asterisk to suppress heading numbering for this section.

Code 2.4 Making headings.

1 \ documentclass [11 pt ]{ article }


2 ...
3

4 \ begin { document }
5 \ chapter { Chapter Name }
6 \ section { Section Name }
7 \ subsection { Sub Section Name }
8 \ subsubsection { Interior Section Name }
9 \ section *{ Section without Number }
10

11 ...

2.2.2.4 Cross References

Cross references are links within a document to another location. It is possible to link to
a figure, table, equation, heading or other parts of the document. LATEXuses the \label
command to identify locations that can be referenced and \ref to link to that reference.
For example, the goal is to create a link in the text that refers to a different section in the
document. Line 2 in Code 2.5 creates a section heading and attaches the label se:title1
to it. Later in the document this is referenced as shown in Line 7. When the document
compiles the text will replace the reference with the section heading number.

Code 2.5 Referencing.

1 ...
2 \ section { Title 1}\ label { se : title 1}
3 Text inside of this section
4

5 \ section { Title 2}
6 Text inside of this section that needs to refer to
7 Section \ ref { se : title 1}.
8 ...

LATEXis a two pass compiler. In the first pass the labels are found and stored in an
auxiliary file. In the second pass this file is then used to connect to the references in the
text. So, it is necessary to compile the document twice to make all of the connections.

25
Some environments such as MikTex performs both passes without user intervention. Most
other user interfaces require that the user compile the document twice. The presence of
two question marks indicate that a cross reference is not made. These exist at the location
of a reference. This means that the partner label does not exist, their is a typo in either
the label or reference, or that the user needs to run the compiler again.

2.2.2.5 Figures and Captions

Figures can be added to LATEXdocuments in two fashions. The first is to use a package
such as Tikz which allows the user to make drawings with programming commands. While
this is an extremely powerful tool, it also has a steep learning curve. The second method
is to load an image file that was created through any other means and stored on the hard
drive. An image an be inserted using the \includegraphics command. An example is
shown in Line 4 of Code 2.6. This has the additional argument of reducing the image size
by a factor of 2. This command inserts the image from the file myfile.jpg.
The code shown does more than just inserting an image. Line 2 begins a figure
region which is dedicated real estate for this image. It is a floating object and so LATEXwill
place it the optimal location so that there are no large blank regions in the document. The
argument [htp] controls this placement indicating that the placement should be here and
if that is not plausible then on the next page. Line 3 will center the figure horizontally on
the page. Line 7 creates the caption and Line 8 creates the cross reference label. There
are many more options that can be used to place the figure, wrap text around the figure
and create subfigures.

Code 2.6 Inserting a figure.

1 ...
2 \ begin { figure }[ htp ]
3 \ centering
4 \ includegraphics [ scale =0.5]{ mydir / myfile . jpg }
5 \ caption { My caption .)
6 \ label { mg : myimage }
7 \ end { figure }
8 ...

2.2.2.6 Equations

The most powerful feature of LATEXis the ability to professional looking equations. The
language used in creating equations is the standard in the industry. Many websites now
use LATEXscripting to create equations. Websites such as http://www.sciweavers.org/
free-online-latex-equation-editor that allow the user to generate equations with

26
pull down menus and see the LATEXcoding. Packages such as MathJax allows websites to
generate equations as the user views them.
Inserting an equation is very easy. An inline equation is surrounded by single dollar
signs (or \( \)). Centered equations are surrounded by double dollar signs (or \[ \]).
Numbered equations use \begin{ equation } and \end{ equation } as shown in Code
2.7. This equation is
E = mc2 (2.2)

Code 2.7 Inserting an equation.

1 ...
2 \ begin { equation }\ label { eq : emc 2}
3 E = m c ^2
4 \ end { equation }
5 ...

LATEXwill automatically number the equations. For a book document the numbering
will also include the chapter number as does this example.
The library of possible symbols is enormous so only a few items are listed here.

ˆ Subscripts begin with an underscore and superscripts begin with a carat.

ˆ Lower case Greek letters use a backslash and spell out the symbol’s name. Example
\alpha produces α.

ˆ Upper case Greek letters use the same method but the first letter of the Greek letter
is capitalized. Example \phi produces φ and \Phi produces Φ.

ˆ Items are grouped by braces. Example e^{-2 x} produces e−2x .


RN
ˆ Math symbols have specified names. Example \int 0 ^N produces 0 .

ˆ Character accents are also named. Example \vec x produces ~x.

ˆ Making inline equations appear as though they


Z Nare stand alones is also possible.
Example \displaystyle \int 0 ^N produces .
0

ˆ Several types of matrices


 areavailable. Example \beginpmatrix 1 & 2 \\3 & 4
1 2
\endpmatrix produces .
3 4

The capability of LATEXto create equations is enormous. Beginners will find benefits
from the Sciweaver website to use the pulldown menus to create equations to help learn
the LATEXlanguage.

27
Table 2.2: My Table

A B C
1 4 4
3 6

2.2.2.7 Tables

There are two keywords used in creating tables. The tabular keyword is used to construct
the grid and contents and the table keyword is used to place the contents in a nice table
perhaps centered on the page with a caption.
An example is shown in Code 2.8. In Line 5 the tabular command is used. Following
that is a code the indicates that there are three columns (three letters), the first column
is centered, the second is left justified and the last is right justified. The vertical lines
indicates that there will be vertical line separators before and after each column. The
table begins in Line 6 with \hline. This creates a horizontal line. Line 7 creates the
first line of items with each column separated by & and the final entry followed by two
backslashes. Line 8 produces a double horizontal line. The following lines finish the table
and it is shown in Table 2.2.

Code 2.8 Creating a table.

1 ...
2 \ begin { table }[ htp ]
3 \ centering
4 \ caption { My Table } \ label { ta : mytable }
5 \ begin { tabular } { | c | l | r | }
6 \ hline
7 A & B & C \\
8 \ hline \ hline
9 1 & 4 & 4 \\
10 3 & 6 & \\
11 \ hline
12 \ end { tabular }
13 \ end { table }
14 ...

This may seem to be a very cumbersome method of creating a table compared to the
point and click methods used in word processors. However, the truth is just the opposite.
If a program is written to generate data that needs to be put into a table then the program
can also be made to include the ability to generate a text string that is the LATEXcoding
for a table. In other words, the user writes a program to make the computations and it
also produces a string such as the text shown in Code 2.2. Then the user can simply copy

28
this string into their LATEXfile. If the user is generating several tables then this method
can be exceedingly faster than placing items into cells, one at a time, by a mouse.

2.2.2.8 Bibliography

LATEXalso has a very nice method of generating a bibliography. Citations are placed in a
single text file and bibtex is used to generate the citations and their links. Many journals
provide bibtex formatted citations on their websites. Code 2.9 shows the entry for a
journal article. This file should be named with a .bib extension. For example, the file that
contains the citations is named cites.bib. It should be noted that there is no indication as
to how the citations are to be presented in the document, but this is merely the citation
information.

Code 2.9 A bibliography entry.

1 @article { Hodgkin 52 ,
2 author = { A . L . Hodgkin and A . F . Huxley } ,
3 title = { A Quantitative Description of Membrane Current and
4 its Application to Conduction and Excitation in Nerve } ,
5 journal = { Journal of Physiology } ,
6 volume = {117} ,
7 pages = {500--544} ,
8 year = {1952}
9 }

In the LATEXdocument, usually at the end, the bibliography is created. Code 2.10
shows the two lines that are used. The first line indicates the style which is named alpha
in this case. Many other styles are available and some journals even provide a template
for their style. The user simply replaces alpha with the desired style. Line 2 actually
places the bibliography in the document at this location. The word cites indicates that
the information is in a file name cites.bib.

Code 2.10 Creating the bibliography.

1 \ bibli ograph ystyle { alpha }


2 \ bibliography { cites }

Finally, it is necessary to place the reference to the citation in the text. The keyword
cite performs this task as shown in Code 2.11. This citation references Hodgkin52 which
is the name of the citation from Line 1 in Code 2.9. Only the citations that are cited in
the text will be placed in the bibliography. So, it is possible to have a large file will all
citations from many projects in the cites.bib file, but only those that have a cite reference
will be printed in the back of the document.

29
Code 2.11 Creating the citation reference.

1 This is the text in the document .\ cite { Hodgkin 52}

There are many citation managers such as JabRef which provides an easier interface
for entering the citation data.

2.2.2.9 Final Comments

LATEXis an exceedingly powerful tool for creating professional documents. The description
provided here is merely the tip of the tip of the iceberg.

2.2.3 LibreOffice

LibreOffice provides an office suite at no cost. It is not quite as powerful as MS-Word but
does have advantages of its own.
The advantages are:

ˆ It is free.
ˆ It has a good equation editor and an add-on will allow for LATEXequation editing.
ˆ It is available for any platform: Windows, OSx or UNIX.
ˆ It can read and write MS-Word documents, but complicated documents do not
translate without problems.

The disadvantages are:

ˆ No journal accepts open documents. Although some are accepting PDFs which
LibreOffice can generate.
ˆ Some features are missing on the program that makes slides (similar to PowerPoint).

2.2.4 Others

There are other document creation systems that are available but tend to lack the ability
to make scientific documents.

2.2.4.1 Google Docs

Google Docs has the advantages of being free and allowing multiple writers to concurrently
work on a single document. However, it does very poorly in creating equations, managing
headers, managing citations or cross references.

30
2.2.4.2 ABI Word

ABI word is freely available for all platforms, but it has limited performance. The issues
are similar to those in Google Docs.

2.2.4.3 Zoho

Zoho is a cloud based office suite that offers features but has traditionally been slow to
use.

2.2.4.4 WPS

WPS (formally known as King Soft) from China that has the look and feel of MS-Office.
It runs on all platforms including smart devices.

31
32
Chapter 3

Computing with a Spreadsheet

Spreadsheets have been a staple in office software for decades. They are excellent tools
for organized data, performing some computations and creating basic graphs. Microsoft
Excel and LibreOffice Calc are two spreadsheets that have sufficient tools for the analysis
tasks in this text. There are other packages but they tend to lack the ability to create
plots and analysis the data therein.
This chapter will review some of the very basic aspects of performing computations
in a spreadsheet. MS-Excel and LO-Calc tend to behave similarly and so the examples
are shown only for MS-Excel.

3.1 Creating Equations

A spreadsheet has a variety of tools for performing mathematical computations. Figure


3.1 shows an incredibly simple example of adding to integers. The formula typed into the
cell starts with an equals sign and the proceeds with the computation. When the ENTER
key is pressed the formula inside of the will be replaced by the answer. The formula still
exists and can be seen in the window just above columns D and E.
Computations in a spreadsheet use the same notation as do most programming
languages. The symbols are:

ˆ +: Addition

ˆ -: Subtraction

ˆ *: Multiplication

ˆ /: Division

ˆ %: Modulus

33
Figure 3.1: A simple calculation.

3.2 Cell Referencing

Typing in values as in Figure 3.1 though is not exceedingly useful as any calculator can
perform such a function. Spreadsheets become more useful with the ability to reference a
value in a cell. Consider the task of adding a value of 8 to the value in another cell. In
order to perform this task the formula needs to reference the contents in this other cell.
This example is shown in Figure 3.2. In cell A1 there is a value of 29 and the goal is to
add 8 to this value and place the answer in cell B1. In B1 is the formula =A1 + 8. The A1
is the identity of the first cell and so the contents of that cell are used in the computation
of the value of B1. Once the ENTER key is pressed the value of 37 will appear in the cell
B1. However, if the value of A1 is changed then the value of B1 is automatically changed
to reflect the new computation.

Figure 3.2: Referencing the contents of a cell.

A formula can reference many different cells. An example is shown in Figure 3.3 in
which the computation in cell C1 uses the values in A1 and B1. Again if either of these
values are changed then C1 is automatically updated.

34
Figure 3.3: Referencing the contents of multiple cells.

3.2.1 Copying Formulas with References

A formula with references can be copied to many different cells and the references will
automatically change. Consider the task of creating an list of incrementing values as shown
in Figure 3.4. This is a small list, but if the task was to have a list that is 1000 cells long
then typing them in by hand is too tedious. A more efficient manner is to use a formula
with a cell reference. In this case the value of 1 is typed into cell A1. Then in cell A2 the
formula = A1 + 1 is typed in. When ENTER is pressed the value of 2 will appear in A2.
The next step is to copy and paste the formula into cell A3. This can be done by either the
copy and paste routine or using the fill down option from the spreadsheet menu. When
the formula is copied in this manner the formula in cell A3 will automatically change to
= A2 + 1. This is called a relative reference.

Figure 3.4: Cell references change as a formula is copied.

To copy to multiple cells the user can copy a cell with a formula, paint many cells,

35
and then paste. The formula will be copied to all of the cells that were painted and in
each one the formula will adjust the cell reference. The second method is to paint all of
the cells that are to receive the formula and the cell that has the formula. In this case
mouse would be used to paint cells A2 to A15. Then the fill-down option (control-D) is
used and the formula in A2 will be copied downwards into cells A3 to A15.
As seen in the example, there is a list of incrementing numbers. The cursor is placed
on cell A7 and the formula in the window above column E is shown. To create a column
of 1000 incrementing numbers the only difference is that the user would paint cells A2 to
A1000 before pasting or filling down. If the value in A1 is changed then all of the values
in the column are changed accordingly.

3.2.2 Absolute Reference

Consider a case that uses the same column A from the previous example and will multiply
every value by 10. A poor example is shown in Figure 3.5 in which the value of 10 is
copied into cells B1 to B15. Now, the task is to multiply the value in column A with the
value in column B. The formula = A1 * B1 is entered into cell C1 and then copied into
cells C2 through C15. The result is as shown and the goal is accomplished.

Figure 3.5: A poor way of creating several similar computations.

However, this is not a very efficient manner in which to perform this computation.
If for, example, the value of 10 needed to be adjusted to a value of 9.8 then all of the cells
in column B would need to be changed. With copy and paste this is not an impossible
task, just an annoying one.
A better solution is to use an absolute reference. Consider the example shown in
Figure 3.6. There is only a single entry in the B column and the desire is to have all
formulas in column C reference that single cell.

36
Figure 3.6: A better way of creating several similar computations.

The dollar sign in a reference means that the reference can not change. Thus a
reference to cell B$1 would prevent the 1 from changing. All formulas in column C would
reference cell B1 as shown in Figure 3.7.

Figure 3.7: All formulas in column C reference cell B1.

A dollar sign in front of the letter in a cell reference would prevent that from chang-
ing. Thus, $B1 would allow the 1 to change but not the B. Finally, $B$1 would prevent
either the column or the row designation from changing.

37
3.2.3 Cell Names

While referencing a cell by column and row designation is useful, it is possible to apply a
different name to a cell. Consider the task of computing the distance an object falls. The
equation for this is,
1
y(t) = gt2 , (3.1)
2
where y(t) is distance fallen as a function of time, t is time, and g is the gravitational
constant. The problem is set up the same way as the previous example. Column A is the
different times in which the computations are made measured in seconds. The gravitational
constant is g = 9.8 m/s2 (meters per second per second) and this value is placed in cell B1.
Before the computations are completed the name of the cell B1 is changed to ‘gravity’.
Above column A there is a window which normally has the designation of the cell such as
‘B1’. The user can override this designation by typing the new name in this window as
shown.

Figure 3.8: Changing the name of a cell.

Column C will contain the values of y(t) for each time t in column A. The formula
= 0.5*gravity*A1^2 is typed into cell C1. The designation ‘gravity’ is used instead of
‘B1’. This formula is then copied to all of the cells needed in column C. These values are
the distance that the object has fallen (in meters) for each time in column A.

3.3 Introduction to Functions

Spreadsheets have a plethora of functions that can be applied to the data in the cells.
This section will only review a few of these functions, but users should be aware that the

38
Figure 3.9: Using the named cell in referenced computations.

library of functions is quite large and the library should be scanned so that the available
functions are familiar to the user.

3.3.1 The Sum Function

The SUM function adds up the values in a specified region. An example is shown in
Figure 3.10 which has a column of values from cell B1 to B16. The sum is to be computed
and placed into cell B17. The function is written as =SUM(B1:B16) which adds up all of
the values in the given range. When the ENTER key is pressed then the value of the sum
is shown in cell B17, and if any of the values in the data are changed then the sum is
automatically updated.

3.3.2 Statistical Functions

The most common computations for statistical are the average and standard deviations.
The function for the first is AVERAGE and for the latter is STDEV. For the example,
the user would type into cell B18 the formula, = AVERAGE(B1:B16) and in cell B19
=STDEV(B1:B16). Again, if the data values are updated then the values of the com-
putations will also be updated.

3.3.3 Comparison Functions

Consider the task of finding the data values that are greater than the average. The
average has already been computed and so this task merely needs to find the values that
exceed a threshold. This can be accomplished with the IF statement, which has three

39
Figure 3.10: Computing the sum of a set of values.

Figure 3.11: Computing the average and standard deviation.

40
arguments. The first is the comparison. The second and third parts are the action to be
taken depending on whether the condition is true or false.
Consider the example shown in Figure 3.12. The statement is constructed in cell
C1. If this value in B1 is greater than the average (which is in cell B18) then a 1 will be
placed in cell C1. If the condition is false then a 0 will be placed in cell C1. In this case,
the dollar sign is used because this function will be copied into cells C2 through C16 and
all will use the value in B18.

Figure 3.12: Constructing an IF statement.

Figure 3.13 shows the result after this formula has been copied into cells C1 through
C16. Those cells with a 1 indicate that the corresponding value in the B column is greater
than the average. The formula in cell B17 is copied into C17 to compute the sum of
column C, which is also the number of data values that were greater than the average.
If the data is changed then the average is updated and so are all of the values in the C
column.
As in many cases, there is an easier way. The COUNTIF function will count the
number of cells that are true for a given condition. The example is shown in Figure 3.14
in cell B21. The COUNTIF function has two arguments. The first is the range of data values
to be considered, and the second is the condition which is in quotes. This will count the
number of cells in range that have a value greater than 4.3125. When the ENTER key is
pressed the count of 6 will appear in the cell.

41
Figure 3.13: Copying the formula to cells in column C.

Figure 3.14: Using the COUNTIF function.

42
3.4 Creating Basic Plots

Spreadsheets do come with the ability to create some types of charts and graphs. This
section will review the methods of creating a line graph and a scatter graph. The spread-
sheets offer several other types of graphs, but as the methods of creating the graphs are
all similar only two types are shown here.
The first example is a simple line plot as shown in Figure 3.15. Data to be plotted is
placed in column A. Then the tab named INSERT is selected (see the top of the image).
The 2-D Line option is selected and a menu appears that has a few selections. In this
case the first one is selected and the chart appears on the screen. The spreadsheet has
automatically determined the range for both axes. There is also a “Chart Title” which
can be changed by double clicking on the Title.

Figure 3.15: Creating a line graph.

This chart assumes that the data is in order and are the heights for data points
that are equally spaced in the horizontal direction. There are cases in which the user
has points to plot. They have a set of (x, y) values and the values in the x axis are not
equally spaced. For this case a Scatter Plot is used. The example is shown in Figure 3.16.
Each row is a data point that is to be plotted with column A containing the x values and
column B containing the y values. In this case the Scatter Plot choice is selected and
again a menu appears which provides the user with several options. The one chosen here

43
creates a smooth curve through the data points.

Figure 3.16: Creating a scatter plot.

The data does not fill the chart window. The spreadsheet has determined the ranges
for both axes and these may be changed by the user. In Figure 3.17 the horizontal range
is altered. The user double clicks on the horizontal axis and a new menu appears. At the
top of the menu are the choices for the beginning and ending of the horizontal range, and
these can be changed manually. In this case that range is changed and the graph is altered
accordingly.
Components of the chart can be altered by the user usually by double clicking on a
region in the graph. The title can be altered in this fashion. The appearance of the axes
can be altered as shown. The color and markers of the data plot can be altered by double
clicking on the plotted data (see Figure 3.18). The background and grid of the plot can
be changed by double clicking on the graph background.

3.5 Function Estimation

Spreadsheets such as Excel and Calc have tools to estimate the functional form of a graph.
One tool is called Trendline which can be used for functions following a basic form (such as
a polynomial or log function), and the second is Solver which can handle much complicated

44
Figure 3.17: Altering the x axis.

functions.

3.5.1 Trendline

The Trendline tool can estimate the parameters of a function as long as the function
is from a specific selection of formulas. Consider the graph in Figure 3.18 that shows
exponentially increasing data points. A right click on the graphed data will recall a popup
menu. There are several options and the one of interest is labeled “Add Trendline.” When
this is selected a new interface appears like the one shown in Figure 3.19. The first step
is that the user must select the correct functional form. If the data is linear then the user
should select the Linear option. The data in Figure 3.18 is not linear but instead is rapidly
rising as does an exponential function. Therefore, the Exponential option is selected.
At the bottom of this interface are two selections that are quite useful. The next to
last option will display the estimated function and the last option will display a measure
of the goodness of fit. These are both displayed in Figure 3.20. Trendline creates an
exponential function with estimated parameters. In this case, the estimated function is,

y = 10e0.0797x .

That is a perfect fit for this data and so R2 = 1. If the fit was less than perfect then
R2 would be less than 1. There is also a thin blue dotted line that plots the estimated
function but as this lines exactly on top of the plotted data it is hard to see.

45
Figure 3.18: Accessing the Trendline tool.

Figure 3.19: Trendline interface.

46
Figure 3.20: Perfect fit trendline.

A second example is shown in Figure 3.21 which is a similar case except that noise
has been added to the data. Thus, the data is no longer a perfect exponential function.
The Trendline process estimates this data to follow the function,
y = 16.289e0.678x .
The R2 value is less than 1 but still quite high indicating that this function fits the data
well. The blue dotted line is now visible and it displays the estimated function along side
the actual data (solid line).

Figure 3.21: Trendline shown with noisy data.

Figure 3.19 shows that there are several functional forms available which are expo-
nential, linear, logarithmic, polynomial, power and moving average. The user is responsible
for selecting the correct form to match the behavior of the data. The incorrect selection
will result in a very poor fit.

3.5.2 Solver

Trendline does work well for the functions in the list, but does not work well for more
complicated functions such as a Gaussian (bell curve). For the more complicated functions

47
Excel and LibreOffice Calc offer a Solver function that can estimate the parameters of a
function that fits the data.
An example fits the data with a Gaussian function with the form,
2 /2σ 2
y = Ae−(x−µ) , (3.2)

where A is the amplitude, µ is the x location of the Gaussian peak, and σ is the half width
of the peak at half height. For this example A = 1 and so the only two parameters are µ
and σ. The raw data is shown in Figure 3.22 which is created by using σ = 3, µ = 0.75
and some random noise is added.

Figure 3.22: Raw data which is a noisy bell curve.

In an actual experiment the values of µ and σ are not known and it is the goal of
Solver to determine these two values that best fit this data. Using Solver requires a bit
more set up work than Trendline. A typical use is shown in Figure 3.23 where the raw x
and y values are in the first two columns. There are 70 rows of data and and this image
only shows the first few rows. Column C contains the two variables µ and σ in cells C3
and C5 respectively. Initially, these values are not known and they are set to 1. Column
D shows the calculated results using equation (3.2) with the two values of µ and σ from
column C. The equation used in cell D2 is shown in line 1 of Code 3.1. Column E is the
squared error between the measured data (column B) and the calculated data (column
D). The Excel command used in cell E2 is shown in line 2 of Code 3.1. The difference
between the measured and calculated data is squared to remove any negative signs and to
accentuate those cases where the difference is large.

Code 3.1 Commands used in figure 3.23.

1 = EXP (-(( A2-C $ 3 ) ^2) /(2* C $ 5 ^2) )


2 =( B2-D2 ) ^2
3 = SUM ( E2 : E72 )

48
Figure 3.23: The spreadsheet architecture for Solver.

Initially, this error is large because the correct values for µ and σ are not known.
The final cell is the sum of the errors which is in cell G2. The equation for this cell is
shown in line 3 of Code 3.1. Since all of the squared errors are positive values the only
way that cell G2 can be zero is if all of the squared errors are zero and this occurs if the
calculated and measured data match exactly. Since there is noise in the data a perfect
solution is not possible, so the Solver will attempt to minimize the error in G2 by changing
µ and σ.
It is possible for the user to manually change these values and keep the changes if
the value of G2 is decreased. Basically, Solver will do the same thing in a much faster
manner. The Solver is accessed by clicking on Data in the menu in the upper ribbon and
then Solver in the submenu. Figure 3.24 shows the dialog window that appears. In the Set
Objective window G2 is entered since this is the cell that is to be minimized. Furthermore,
the Min button is selected. Finally, the By Changing Variable Cells window contains the
cells that are to be altered and in this case that is cell C3 and C5. Finally, the Solve button
at the bottom of the window is pressed and Solver computes new values for µ and σ.
The computed values are µ = 3.00581 and σ = 0.77699 which are very close to
the values used to generate the data. Had there been no noise then Solver would have
recovered the exact values for µ and σ. The final squared error is 0.140. Since the values
of µ and σ are now changed the values in columns D and E are also changed. Figure 3.25
shows the new values of column D plotted along with the original data. As seen there
is a fairly close match and thus the Gaussian function estimate of the measured data is
complete.
Solver is much better suited for problems that Trendline can not solve. It is im-
portant in each case to make sure that the answer provided by the algorithm matches
the data. The Solver will return an answer but in some cases the answer may not be
sufficiently correct. This is a common issue with these types of algorithms where they can
not home in on a solution or there is something in the data that prevents the algorithm
from finding an acceptable solution. If the solution is insufficient then the user needs to
identify if there are data points that violate mathematical rules (square root of a negative
number, divide by zero, etc.) and remove them. If there are a lot of data points then
another approach in finding an appropriate solution is to perform the curve fit on a subset
of the data.

49
Figure 3.24: The Solver interface.

Figure 3.25: Plots of the original data and the Solver estimate.

50
Problems

1. In an spreadsheet cell compute 5 + 6.

2. In an spreadsheet cell compute the square root of 49.


π
3. In an spreadsheet cell compute the cosine of 4.

4. In an spreadsheet cell convert the angle 45◦ to radians using the RADIANS function.

5. Create a column of 1000 numbers in which each number is the sum of the two
numbers directly above it. The first two numbers in the column should be 0 and 1.

6. In a spreadsheet column create 1000 random values (using the RAND function). In
the 1001 cell compute the average of these values.

7. Using the example Section 3.2.3, compute the y(t) values for an object on the moon
in which the gravity is only 1.68 m/s2 .

8. The equation for a falling object that has an initial speed of v0 is y(t) = v0 t + 12 gt2 .
Modify the example in Figure 3.9 so that it includes an initial speed of v0 = 1.3 m/s.
Place the value of v0 in cell B2 and use that cell in the new computations in column
C.

9. Create a plot for 0.5x2 + 7 for values of x ranging from -5 to +5.

10. Create a scatter plot for the following data (0,0.01), (0.2, 0.034), (0.4,0.15), (0.7,0.5),
(0.9,1.0), (1.1,1.3), (1.4,2.0), (1.8,3.15), (2.0, 4.01).

11. Use Trendline to find the function that best fits the data in the previous problem.

51
52
Chapter 4

Gene Expression Arrays: Excel

Gene expression arrays are biological experiments that can gather information about the
content of a sample for thousands of genes. This data is collected and available as spread-
sheets from the NIH. The experiment used here gathers information about about 800 genes
for healthy men and women. This chapter will use the tools in a spreadsheet to analyze
this data.

4.1 Data

A gene expression array is a small plate with samples of hundreds of genes attached in
an array of small spots. A sample with perhaps unknown DNA contents is washed over
the plate and if the DNA attached to the plates is similar to the DNA in the input then
the two will adhere. The input has a dye attached to it that can be detected through
optical means. The quick description is that if the sample on the plate mates with the
input sample then the sample on the plate will also collect an amount of dye.
These are delicate experiments and so it is difficult to exactly replicate the same
experiment. The solution is that each experiment has two input samples each with different
dyes. Instead of analyzing the amount of dye at each sample spot, the researchers analyze
the ratio of the two dyes.
The data used in this chapter is obtained from https://www.ncbi.nlm.nih.gov/
geo/query/acc.cgi?acc=GSE6553. The code to the right of the colon indicates which
samples are used in the file and in which order. The ‘F’ indicates that the sample was
from a female and the ensuing numerical value is the ID. So, ‘F51’ is a sample from a
particular female. In the first file, the female F51 is the first sample and the male M58 is
the second sample.

ˆ GSM151667 : F51 M58


ˆ GSM151668 : M58 M57

53
ˆ GSM151669 : M57 M56
ˆ GSM151670 : M56 M55
ˆ GSM151671 : M55 F53
ˆ GSM151672 : F53 F52
ˆ GSM151673 : F52 F51
ˆ GSM151674 : F53 F51
ˆ GSM151675 : F52 M57
ˆ GSM151676 : M55 M58

There are three sections to this file. The first is less than 100 rows and provides
information about the experiment. The second section has a large number of rows with
each row associated with a single sample on the slide. This is the data after analysis
and usually what the investigator would use. However, the intent of this chapter is to
demonstrate how to load and analyze the data. Thus, this chapter will use the third
section of data in the spreadsheet. In the file GSM151667 there are 1600 genes and so
there are 1600 rows of analyzed data and about 50 rows of experiment information. Thus,
the raw data starts on row 1650. This shows the detected data and of this there are only
a few columns that will be used here.
The first column is the ID number which is unique to each row. The spots on the
plates are arranged in a rectangular array of blocks. Each block contains a rectangular
array of spots. The next four columns identify the vertical and horizontal position of the
block and the vertical and horizontal position of the spots within that block.
Column F shows the name of the gene. Columns G and H show the (x, y) location
of the spot on the original image obtained from the scanner. Column I is the measured
intensity of the spot. This corresponds to the amount of dye on that spot. However, there
is also a background value and this is stored in column J. This is the data for channel 1
which corresponds to the first item listed in the file name. So, for the first file, channel 1
corresponds to female F51. Channel 2, male M58, intensity and background data is shown
in columns U and V. There are many other columns but they will not be used in this
chapter.

Figure 4.1: A small portion of the detected data.

The goal is to find genes that are turned on in one channel but not the other. This
is called an expressed gene and a general rule of thumb is to find those cases in which the
intensity value of one channel is twice (or more) as much as in the other channel. However,
there are many issues that confound this simple comparison. The dyes do not provide the
same illumination for the same sample size, there are a lot of biological and optical issues

54
that affect data collection. Thus, direction comparison is not readily possible.
The rest of this chapter demonstrates one method of performing the analysis. How-
ever, the first step is to copy the pertinent data to a new page in a spreadsheet. Figure
4.2 shows the new page with the six selected columns copied therein.

Figure 4.2: The pertinent data is copied to a new sheet.

4.2 Background

The data is collected through an optical detector but the background is not zero. Further-
more, the background signal is not uniform across the plate that contains the samples.
The machine measures the intensity of each spot but also measures the intensity around
the spot and determines that this is the background signal.
The analysis begins with the subtraction of the background signal from the intensity.
This is repeated for every spot in both channels. There are often a few spots that misbe-
have either in the biological process or the detection process and the background signal
can be higher than the intensity signal. For those few cases the data with be discarded in
this analysis.
The subtraction for channel 1 is placed in cell H2 and the command is =IF(C2>D2,C2-D2,1)
which places the subtraction value if the intensity is greater than the background. If that
is not the case then the computation inserts a value of 1. Later in the analysis the log of
the values will be computed and thus the 1 is used here knowing that it will become 0 in
the final steps. The first few rows are shown in Figure 4.3.

Figure 4.3: The subtraction of the background.

55
4.3 Visualization

Commonly, the two dyes are called red and green because those are the colors that are used
on the computer display to represent them. The dyes used are Cy3 and Cy5 as displayed
in C35 and C36 in the original data file. These dyes have peak responses near 570 nm and
670 nm respectively. In the visible spectrum these are wavelengths of yellow and red, but
green is visibly more pleasing and is used for display.
In this data, channel 1 used Cy3 and channel 2 used Cy5. Each spot of data then
has a green and red value. Figure 4.4 shows the R vs G plot which converts the Cy3 and
Cy5 data to (x, y) points.

Figure 4.4: The R vs G plot.

There are a few issues with this type of display. The first is most of the display is
blank. Often if that is the case in a plot then resolution of data is sacrificed. The second
problem is that the data does correspond well to a 45◦ line. It is expected that most genes
have about the same response as males and females share many genes. If that is the case
then the R and G values should be about the same and therefore the data should crowd
around a 45◦ line, but it does not.
These issues will be addressed, but for now it is recognized that this is not the best
way to display the data. Then intensity of a spot with R and G values is now defined as
their average. This is represented by I = (R + G)/2.
Figure 4.5 shows the data for I and R/G. The plot of the data is shown in Figure
4.6. The x axis corresponds to the intensity of the spot and the y axis corresponds to the
ratio R/G.
This graph provides more resolution for the ratio of the responses of the two dyes.

56
Figure 4.5: The R/G vs I data.

Figure 4.6: The R/G vs I plot.

57
An expressed gene is one in which one channel is at least twice as much as the other.
Clearly there are several points that have a vertical value of more than 2. This are the are
spots in which channel 1 is more than twice as much as channel 2. The reverse, though,
is more difficult to see. The cases in which channel 2 is twice as much as channel 1 are
those in which the vertical value is less than 1/2. The nature of this graph does not allow
those points to be easily seen.
The solution is to compute the log of the values. Consider that log2 (2) = 1 and
log2 (0.5) = −1. In a log graph the ratios become linear values which will display expressed
genes equally for either channel. Two new values are defined as,

A = log2 (I)

and
M = log2 (R/G).

The spreadsheet function LOG( v, n ) can receive two values in which v is the input
value and n is the log base. Thus, log2 (x) is written as =LOG(x,2). The values of A and
M are computed in columns N and O and the first few are shown in Figure 4.7. The graph
is shown in Figure 4.8.

Figure 4.7: The M vs A data.

The horizontal axis corresponds to the log of the intensity and the vertical axis
corresponds to the log of the ration R/G. Values above 1 and below -1 are now considered
as expressed genes.
If the ratio R/G was 1, which is expected for many genes then the data points would
at y = 0. However, the majority of the data points are not along this line. Instead, at
lower intensities there is a strong bias above that line. Again, collecting this data is not an
exact science and there are biases. One bias could simply be that the dyes react differently
to the illumination. This bias must be removed from the data before expressed genes can
be identified.

4.4 Normalization

LOESS normalization separates the data into small windows and then subtracts the av-
erage of the window for all of the data within it. For example, the data may be separated
into windows of 50 data points. The leftmost 50 points are the first window. The average

58
Figure 4.8: The M vs A plot.

of the points within this window are computed, and this average is subtracted from those
points. This will ensure that the average of each window is zero and will remove the
vertical bias that is currently inherent in the data.
There are two steps involved in employing this normalization in a spreadsheet. The
first is that the data has to be according to the A value. The second is that the average
of a sliding window has to be subtracted from values.
The first task is to sort the data. This is done in two steps. The first is to copy
three columns of data to a new location in the spreadsheet. This will allow the ability
to rearrange the data without disturbing calculations already performed. In this example
there are 1600 genes and thus the calculations in the previous section consumes slightly
more than 1600 lines in the spreadsheet. The copied data needs to be at least 25 rows
below the last row of data. In this example, the data is placed in row 1630. Three columns
of data are needed. These are the gene number, the A and the M data that were just
calculated. This data is sorted on the A data and a portion of that is shown in Figure 4.9.
The gene number is required in order to resort the data in a later step.
The second step of the LOESS normalization is to divide the data into windows
of 50 points. This is a little time consuming in a spreadsheet and so the algorithm is
modified slightly. For each value a window of 50 points will be considered, but this is a
sliding window. The 100-th data point in this case is on row 1731 in the spreadsheet. The
window of 50 points will be the 25 points before and after. So, for this point the average
is calculated from rows 1706 to 1756.
The reason that there are at least 25 empty rows above this data is to make it easier
to perform this computation in the spreadsheet. The first row of data is on line 1631 and

59
Figure 4.9: Sorted data.

the average needs to be computed for the 25 points before and after this. However, there
are no points before this. The spreadsheet calculation of an average will not include cells
if they are empty, and so the calculation of the average for the 25 rows before and after
this first data point will not use the 25 rows before in computing the average. Thus, the
same equation can be used for all rows.
The result is shown in Figure 4.10. The equation placed in cell E1631 is =AVERAGE(C1606:C1656)
and the equation placed in cell F1631 is =C1631-E1631. The value in E1631 is the average
of the 25 rows before and after row 1631. The value in cell F1631 is the value of M with
the average subtracted.

Figure 4.10: Sorted data with the average removed.

Now the average data falls along y = 0. Genes expressed in channel 1 have a value
of y > 1 and genes expressed in channel 2 have a value of y < −1.
As seen there are spurious points usually at low intensities. Recall that that x = 0
corresponds to the case in which the original intensity is the same as the background. Some
researchers simply discard the spurious data points since they occur at very low intensities
with the belief that it is not possible to detect them accurately or that something has
gone wrong with the spot on the plate. However, there are arguments that there is still
information within these points and discarding them may be throwing away important
information.
In any case, the LOESS normalization has removed the bias and now it is possible
to find the expressed genes.

60
Figure 4.11: Plot of the data with the average removed.

4.5 Comparing Multiple Files

In this data set there are multiple files and finding expressed genes should consider all
pertinent trials. Consider the question of finding the genes that are expressed by males
but not by females. In this case, only the file that had both a male and female should be
used. For this question there are only three qualified files in the set.

ˆ GSM151667 : F51 M58


ˆ GSM151671 : M55 F53
ˆ GSM151675 : F52 M57

The first file has the male information in the second channel and thus expressed
genes would have a value of less than 1. The second file has the male in the first channel
and thus expressed genes should have a value of 1 or greater.
However, in comparing multiple files it must be considered that there are differences
in the experiments that will bias and scale the data.
The process begins with collecting the data. The process of the previous sections is
applied to all files that will be used. Each file is processed to obtain normalized data such
as in Figures 4.9 and 4.10. One of the issues is that this data is sorted differently for each
file and so it is necessary to resort the data according to the gene number.
Figure 4.12 shows part of this data. This shows the first file data files after LOESS
normalization and the data being sorted again according to the gene number,

61
Figure 4.12: A partial view of data from all of the files after LOESS normalization.

Below each column of data the average and standard deviation are computed. The
first values of the first three files are shown in Figure 4.13. Most of the averages are
similar but the standard deviations are not. This means that each experiment had different
sensitivities.

Figure 4.13: The average and standard deviation of the first three files.

The process is to first subtract the average from each experiment. So, the average
of each column is subtracted from the values of that column. The equation in cell B1606
is =B2-B$1603. This is copied for all the files to the right and 1600 rows down to include
all of the genes. The first few values from the first three files are shown in Figure 4.14.

Figure 4.14: The data after the average is subtracted.

Subtracting the average will not alter the standard deviation. Thus, each file still
has a different range of sensitivity. Since most of the genes are not differentially expressed
it is expected that the standard deviations of the experiments should be the same. To
accomplish this, each value in an experiment is divided by the value of the standard
deviation. This is shown in Figure 4.15 for the first few rows of the first three files. The
formula in cell B3209 is =B1606/B$3207. Again this formula is copied to the right for each
file and copied down for each gene.
Now, each file has the offset and bias removed allowing the files to be compared to
one another. It is now possible to pursue a question such as: which genes are expressed

62
Figure 4.15: The data after division by the standard deviation.

in males and not in females.


Again, only three files are used to pursue this question. A new page in the spread-
sheet is created that contains the necessary information and a part of it is shown in Figure
4.16. This has the gene numbers and names in the first two columns. The next three
columns are the data for the three files after the standard deviation normalization.

Figure 4.16: Data available to answer the male-only question.

In this case thee search is for genes expressed in males but not females. Since the
first file put the male in the second channel the search is to find values in that column
that are less than -1. The search also wants values in the second column greater than 1
and value less than -1 in the third column.
A partial result is shown in Figure 4.17. The formula in cell G2 is =IF(C2<-1,1,0)
which tests for the values of less than -1 in column C. If this is True then the value of 1
appears in the cell. The other two values have appropriate tests in columns H and I.

Figure 4.17: Comparing the values.

In a perfect world, and expressed gene would appear as 1’s in all three columns. In
a real world, it is expected that the results will not be so clear. Column J sums the three

63
testing rows and the any value of 2 or 3 can be considered as an indication of an expressed
gene.
In this experiment, each gene had two spots on the plate. Notice that each gene
name is repeated. So, the condition for expressed gene is that it must be expressed in at
least two of the files for each sample of the gene. Basically, the sum in column J must the
2 or 3 for both instances of the gene.
There are 1600 rows and so this can be a tedious process. One solution is to use
conditional formatting. This will automatically change the format of cells depending on
the value in those cells. In turn, this will make it easier for the human viewer to spot the
few genes of interest.
Figure 4.18 shows the manner in which conditional formatting is accessed through
LibreOffice. The data that is to be formatted is painted and the user selects Format:
Conditional Formatting: Condition.

Figure 4.18: Accessing conditional formatting.

The selection creates a pop up window which is in the background of Figure 4.19.
The condition is set to change the formatting if the value is 2 or greater. A new formatting
style is selected and this produces the foreground pop up window which allows the user
to select font, size, and color. In this case the selection is to change the background color
to yellow for the cases where the condition is true.
A small portion of the file is shown in Figure 4.20. There are many locations in
which a cell is painted yellow, but that only occurs for one instance of a gene. In this
figure, there are two genes in which column J is 2 or 3. These are genes to be considered
as expressed in men and not women.
Well, this is a small test and of the three expressed genes none scored a value of 3
in column J. The genes of interest are:

64
Figure 4.19: Changing the format.

Figure 4.20: Partial results.

65
ˆ protein phosphatase 5, catalytic subunit
ˆ intercellular adhesion molecule 1 (CD54), human rhinovirus receptor
ˆ phospholipase C, beta 2

The NIH contains records for each gene and neither of the last two had any mention
of gender preference. The first one did mention that “elevated levels of this protein may be
associated with breast cancer development” and so the absence of this gene is preferred.[?]
This test was small in size with only a few participants and there was no guarantee that
there were male specific genes in the plates.

66
Part II

Python Scripting Language

67
Chapter 5

Python Installation

Python is one of the fastest growing languages available and is pervasive in all fields
of science. The language is free to obtain and is one of the easiest languages to learn,
particularly for users that have very little programming experience.
The most important feature of Python, though, is that it is a very powerful language
that can perform a wide variety of tasks.
There are two versions of Python that are being used. Version 2.7 is the last in
the version 2 series and has a complete set of tools and is being widely used. Version 3.x
(where the x is still changing) is a newer version that is very slowly replacing 2.7. The
toolset is still catching up to 2.7.
This book uses Version 3.x and attempts to note the differences in places were 2.7
differs.

5.1 Repository

The main website is http://python.org. However, other repositories provide a large set
of third party tools that do not accompany the original installation. This section quickly
reviews methods to install Python on popular platforms.

5.1.1 Windows

Windows users should go to http://scipy.org and get one of the packages that is listed.
These packages tend to have a very large set of third party tools, some of which will be
used here. A no-cost repository will be more than sufficient for the work in this book.
Windows users still have the choice between 32 or 64 bit installations. If the user’s
computer is a 32 bit computer then that is the installation that must be used. A 32 bit
version of Python is sufficient for the work in this book, but will limit the user to only

69
4 GB of workable memory. Thus, if the user is planning on using Python for large scale
problems, then a 64 bit installation may be more appropriate.

5.1.2 MAC

MACs should come with Python installed. However, this installation may not have an
integrated development environment (IDE) or other required modules such as numpy.
Installation of third party components is possible if pip is installed on the computer.
Once pip is installed then the commands in Code 5.1 will install the third party modules
that are required. These commands are executed in the OSx terminal.

Code 5.1 OSx commands.


1 sudo pip install numpy
2 sudo pip install scipy
3 sudo pip install pillow

If pip is not installed then the commands in Code 5.2 may be used to install the
required modules.

Code 5.2 Alternative OSx commands.


1 sudo easy_install numpy
2 sudo easy_install scipy
3 sudo easy_install pillow

These three installations add onto Python the ability to manipulate vectors and
matrices. They also add a large scientific computing library and the ability to read and
write most types of images.

5.1.3 UNIX

Flavors of Linux should have Python installed. Some even have both versions of Python.
However, installations may not included an integrated development environment (IDE) or
other required modules. Use the Linux software manager to get the additional modules
as needed.
Users need to install:

ˆ numpy which is the numerical library,


ˆ scipy which is the scientific computing libary, and
ˆ Python Image Library (also know as PIL or PILLOW) which provides some basic
image tools.

70
5.2 Setting up a Directory Structure

In any programming language it is a good idea to establish a directory structure before


getting started. Commonly, a project (such as a homework assignment) warrants the
creation of its own directory. Subdirectories could include: data, documents, computer
codes and results. An example starts with Figure 5.1 which would be the homework
directories for a course. This image is from the Ubuntu operating system and so it looks
different than it would on Windows or OSx. However, the logic is the same. The buttons
across the top show the directory structure which is my directory for this course and
currently the view is of the directory named HW which is where the homework assignments
are kept. The two icons show that directories for the first two homework assignments have
been created.

Figure 5.1: The top working directory.

Figure 5.2 shows the content of HW1 and inside of these subfolders the appropriate
files can be stored. There is a folder for the data and another for documents. The folder
pysrc is where the Python source code will be contained. The folder results is where the
results from the computation would be stored.

Figure 5.2: The working subdirectory.

Failure to create these directories will eventually lead to lost files. Consider a case
where a student has several homework assignments and he decides to name the file created
by his programs by the name output.txt. If the student is using a single directory then
it becomes quite easy for one homework assignment to erase the results from a previous
assignment. Furthermore, when it is time to turn in the files for the assignment it is
possible to turn in files from a different assignment since all of the files are in the same
directory.

71
5.3 Online Python

There are several online Python resources. Most of these have the basic Python installation
but do not have all of the capabilities that we need.
So far, the only resource has been identified for this course that has the following
components:

ˆ Numpy

ˆ Scipy

ˆ Python Image Library

ˆ The ability to upload and download data files.

This resource is: http://pythonanywhere.com. Access to this system is free but


requires that an account be made. The pythonanywhere can be tried without an account
at: https://www.pythonanywhere.com/try-ipython/. However, this does not have all
of the tools that we need in this class.

72
Chapter 6

Python Data and Computations

Python is a very powerful language that can perform an extensive variety of tasks. Only
those tasks pertinent to computations for biological applications are reviewed here. The
reader should be quite aware that Python is a far more extensive language than this book
presents.

6.1 Comments

Comments can be inserted into Python code with the # sign. All text following this sign
is not read by the Python interpreter. Most Python editors will color code a comment
line. An example is shown in line 1 of Code 6.1 in which everything to the right of the #
symbol is ignored by Python.
Comments are purely for the human to keep notes on what a program is doing or
what variables mean. Comments can consume several lines but each line must start with
#. Comments are highly recommended particularly if the script will be read by other
users or will be used for multiple purposes. It is easy to understand what a script is doing
just after it is written, but two years later the programmer may have forgotten the purpose
of some of the lines of code. Comments will help jog the memory.
As a rule: you don’t have enough comments in your script.

6.2 Numerical Data

There are two main types of data: numerical and characters. This section will review
methods by which Python can represent numbers.

73
6.2.1 Assignment

Python, like all programming languages, has variables which can adopt a numerical or
string value. In Python the assignment of a numerical value is quite easy as shown in
Code 6.1. The variable name can have several letters and even numbers (as long as the
number is not the first character) as seen in line 2. Capital letters are treated as being
different that lowercase letters. Printing a value to the console is performed using the
print function as shown in the last lines of Code 6.1.

Code 6.1 Variable assignment.

1 >>> a = 5 # This is a comment


2 >>> bcd = 10
3 >>> print a # Python 2.7
4 5
5 >>> print ( a ) # Python 3. x
6 5

6.2.2 Simple Computations

Variables can be used in mathematical computations as shown in Code 6.2. Python uses
the standard math symbols as shown in Table 6.1.

Code 6.2 Simple math.

1 >>> abc = 10
2 >>> a + abc
3 15
4 >>> a * abc
5 50

Table 6.1: Math functions.

Function Symbol
Addition +
Subtraction −
Multiplication *
Division /
Power **
Modulus %

Numerical data can be stored in several formats. Most languages offer the ability to
store integer values or floating point values. The precision of these can also be specified.

74
Python does have a complex data type which is not very common among other languages.
In early computers with small amounts of memory the designation of precision was impor-
tant. In today’s modern 64-bit computers this designation is no longer a concern. Some
of the data types in Python are:

ˆ int As an integer. No decimal values are allowed.

ˆ float . Capable of storing a decimal value.

ˆ long : A decimal value with a much larger range of values

ˆ complex : A complex valued number: Example is 1 + 2j where the symbol j is the


square root of -1.

Very large values are presented as scientific notation. An example of scientific nota-
tion is: 42300 = 4.23 × 102 . Computer languages use the ‘e’ or ‘E’ symbol to denote the
exponent value. So, the number 42300 in Python can be entered as shown in Code 6.3.

Code 6.3 Expontial notation.

1 >>> 423 e2
2 42300.0


Complex numbers are represented in engineering notation where j −1. Line 1 in
Code 6.4 creates a complex value. The real and imaginary parts are extracted as shown
in the Code.

Code 6.4 Complex values.

1 >>> g = 3 + 1 j
2 >>> g
3 (3+1 j )
4 >>> g . real
5 3.0
6 >>> g . imag
7 1.0

Converting from one type to another requires the use of a keyword such as int,
float, complex, etc. It should be noted that the int typecast will simply eliminate the
decimal part of the number. In order to compute the rounded value the round function
can be used. These conversions are shown in Code 6.5
Errors can occur in rounding if the variable is exactly halfway between to integers.
Consider line 1 in Code 6.6 where the value of 4.5 is correctly rounded to 5. This was
performed in Python 2.7. Line 3 shows the same operation in Python 3.x and as seen the

75
Code 6.5 Type conversion.

1 >>> float ( 5 )
2 5.0
3 >>> int ( 6.7 )
4 6
5 >>> round ( 6.7 )
6 7.0

rounding function incorrectly rounded to 4. Line 5 shows the case in which a very tiny bit
is added to get the rounding function to provide the correct value.

Code 6.6 Rounding error.

1 >>> round (4.5) # Python 2.7


2 5.0
3 >>> round (4.5) # Python 3. x
4 4
5 >>> round ( 4. 5 0 00 0 0 00 0 0 00 0 0 1)
6 5

The result of a computation tends to return a value whose data type is the same as
the most complicated data type in the computation. For example, if an integer is added
to a float then a float is returned. If a float is added to a complex number then a complex
number is returned.
The exception to this rule is integer division in Python 3.x. Integer division in
Python 2.7 returns an integer as shown in the first two lines of Code 6.7. Thus a division
such as 8/9 would return a 0. Python 3.x behaves differently and returns a floating point
value as seen in the last two lines.

Code 6.7 Integer division.

1 >>> 9/4 # Python 2.7


2 2
3 >>> 9/4 # Python 3. x
4 2.25

6.2.3 Algebraic Hierarchy

Common in programming languages is the adherence to algebraic hierarchy. These rules


govern the order in which computations are performed. The hierarchy is,

76
1. Power,

2. Multiplication and Division, and

3. Addition and Subtraction.

Consider Code 6.8 which shows a simple computation in line 1. If the process is
done in the order shown then 2 + 5 is 7 and that multiplied by 3 is 21. However, the
answer is 17. The reason is that the multiplications are performed before the additions.

Code 6.8 Algebraic hierarchy.

1 >>> 2 + 5 * 3
2 17
3 >>> (2+5) * 3
4 21

Users can control which calculations are performed first by enclosing them in paren-
thesis. Line 3 shows this by enclosing the 2 + 5 in parenthesis and thus this is performed
before the multiplication.

6.2.4 The Math Module

Python does come with a math module that contains basic functions. Code 6.9 shows the
import statement in line 1 that will read in all of the math functions. Not all of them are
shown here.
To raise a number to a power, such as xy , the pow function is used as shown in
line 2. This will perform 34 which produces the answer of 81. The opposite function is
the square root which is called by sqrt as shown in line 4. This computation could also
be performed with the pow function as shown in line 6. In fact, functions such as a cube
can be performed with the pow function by using 1/3 as the second argument.

Code 6.9 Algebraic functions.

1 >>> from math import *


2 >>> pow ( 3 ,4)
3 81.0
4 >>> sqrt ( 40 )
5 6.3245553 203367 59
6 >>> pow ( 40 ,1./2)
7 6.3245553 203367 59
8 >>> hypot (3 ,4)
9 5.0

77
The last function shown is the hypot which computes the hypotenuse of a right
triangle with the length of the sides being used as the argument.
This module also contains several trigonometric functions such as since, cosine,
tangent, their inverse functions and the hyperbolic functions for all. Code 6.10 shows
a simple example of computing the sine of the angle π/2. Like most computer languages
the computation assumes that the input argument is in radians and not degrees. However,
the module provides two conversion functions radians and degrees. Line 3 shows the
conversion of an angle in degrees to radians before the sine is computed.

Code 6.10 Trigonometric functions.

1 >>> sin ( pi /2)


2 1.0
3 >>> sin ( radians (90) )
4 1.0

The module includes the following trigonometric functions:

ˆ acos: inverse cosine,


ˆ acosh: inverse hyperbolic cosine,
ˆ asin: inverse sine,
ˆ asinh: inverse hyperbolic sine,
ˆ atan: inverse tangent,
ˆ atan2: inverse tangent sensitive to quadrants,
ˆ atanh: inverse hyperbolic tangent,
ˆ cos: cosine,
ˆ cosh: hyperbolic cosine,
ˆ degrees: convert radians to degrees,
ˆ pi: the value of π,
ˆ radians: convert radians to degrees,
ˆ sin: sine,
ˆ sinh: hyperbolic sine,
ˆ tan: tangent, and
ˆ tanh: hyperbolic tangent.

78
Code 6.11 Exponential functions.

1 >>> e
2 2.7182818 284590 45
3 >>> exp (1)
4 2.7182818 284590 45
5 >>> log (100)
6 4.6051701 859880 92
7 >>> log10 (100)
8 2.0
9 >>> log2 (100)
10 6.6438561 897747 24

79

6.3 Python Collections

Python offers four methods of collecting items. The items in this chapter are unique to
Python and are not found in many other languages.

ˆ tuple

ˆ list

ˆ dictionary

ˆ set

6.3.1 Tuple

A tuple is collection of items that can not be changed. The tuple can contain almost any
type of data such as floats, strings, other tuples, etc. A tuple is encased in parenthesis as
shown in Code 6.12. The following lines create a tuple and then prints the contents to the
screen.

Code 6.12 A tuple.

1 >>> a = (2 ,4 , " howdy " , 5 , " CDS 130 Rocks " )


2 >>> print a
3 (2 , 4 , ' howdy ' , 5 , ' CDS 130 Rocks ' )

To get single items from a tuple, square brackets are used as shown in Code 6.13.
The number inside of the square brackets is the item number from the tuple. Python,
like C and Java, starts counting at 0. Yes, it is weird, but it does make sense if one
understands how computers point to data in the memory. Anyway, the first item in the
tuple is retrieved in the first line of code. The second and third items are retrieved in
subsequent lines.
The last item in a tuple as shown in line 1 of Code 6.14. Line 2 shows the retrieval
of the next to the last item.
It is possible to get several consecutive items as shown in Code 6.15. Now there are
two numbers inside of the square brackets. The first is the starting point and the second is
the ending point. However, it should be noted that the returned data includes the starting
point but excludes the ending point. This command retrieves items a[0], a[1], and a[2]
but does not retrieve a[3].

80
Code 6.13 Accessing elements in a tuple.

1 >>> a [0]
2 2
3 >>> a [1]
4 4
5 >>> a [2]
6 ' howdy '

Code 6.14 Accessing the last elements in a tuple.

1 >>> a [-1]
2 ' CDS 130 Rocks '
3 >>> a [-2]
4 5

Some more examples are shown in Code 6.16 Line 1 retrieves the second, third and
fourth items. Line 3 is the same as line 1 in the previous code. The 0 is not necessary in
this case. Line 5 gets the last two items.

6.3.2 List

A tuple can not be altered. A list is similar to a tuple except that it can be altered. A list
uses square brackets. Line 1 in Code 6.17 creates a list with four items in it. It should be
noted that the list item in the list is the tuple defined above.
An item in a list can be replaced. Line 1 in Code 6.18 changes the first item in the
list.
A list can grow. The append command will attach a new item onto the end of the
list as shown in Code 6.19

6.3.3 Dictionary

A dictionary is similar to a hash table in other languages. The idea is similar to a word
dictionary which contains thousands of entries. Each entry is a word and its definition.
However, a person can only search on the word and can not do a search on the definition.
A dictionary uses curly braces. Line 1 in Code 6.20 creates an empty dictionary and
line 2 creates the first entry in the dictionary. The key is the item in the square brackets
and the value is the item(s) to the right of the equals sign.
The key can be an integer, a float, a tuple or a string. Line 3 in Code 6.20 uses a
string as the key and the value is a tuple. To retrieve an item from a dictionary the key

81
Code 6.15 Accessing consecutive elements in a tuple.

1 >>> a [0:3]
2 (2 , 4 , ' howdy ' )

Code 6.16 Accessing consecutive elements at the end of a tuple.

1 >>> a [1:4]
2 (4 , ' howdy ' , 5)
3 >>> a [:3]
4 (2 , 4 , ' howdy ' )
5 >>> a [-2:]
6 (5 , ' CDS 130 Rocks ' )

Code 6.17 A list.


1 >>> b = [ 45 , 4.6 , " Hello All " , a ]
2 >>> b
3 [45 , 4.6 , ' Hello All ' , (2 , 4 , ' howdy ' , 5 , ' CDS 130 Rocks ' ) ]

Code 6.18 Changing an element in a tuple.

1 >>> b [0] = -1
2 >>> b
3 > > >[-1 , 4.6 , ' Hello All ' , (2 , 4 , ' howdy ' , 5 , ' CDS 130 Rocks ' ) ]

Code 6.19 Appending an element to a list.

1 >>> b . append ( ' More ' )


2 >>> b
3 [-1 , 4.6 , ' Hello All ' , (2 , 4 , ' howdy ' , 5 , ' CDS 130 Rocks ' ) , '
More ' ]

Code 6.20 A dictionary.

1 >>> a = { }
2 >>> a [0] = ' my data '
3 >>> a [ ' John ' ] = ( ' 507 Main ' , ' Cincinnati ' , ' Ohio ' )

82
is used as shown in Code 6.21,

Code 6.21 Accessing data in a dictionary.

1 >>> a [0]
2 ' my data '
3 >>> a [ ' John ' ]
4 ( ' 507 Main ' , ' Cincinnati ' , ' Ohio ' )

6.3.4 Set

A set is just like the sets that were studied in elementary school. It is possible to perform
intersections and unions as shown in Code 6.22.

Code 6.22 Sets.


1 >>> c = set ( (1 ,2 ,3) )
2 >>> d = set ( (3 ,4 ,5) )
3

4 >>> c . union ( d )
5 Set ([1 , 2 , 3 , 4 , 5])
6

7 >>> c . intersection ( d )
8 Set ([3])
9

10 >>> g = (1 ,2 ,3 ,4 ,3 ,2 ,1 ,2 ,3)
11 >>> set ( g )
12 set ([1 , 2 , 3 , 4])

6.3.5 Slicing

Slicing is the term used for the extraction of part of the information from a tuple, list, etc.
Line 1 in Code 6.21 is an example. Do further demonstrate slicing techniques consider
the tuple defined in Code 6.23. The number of items is extracted by the len command as
shown in line 3.
Several examples are shown in Code 6.24. Line 1 shows the retrieval of the first
item. Line 3 shows the retrieval of the second item. Line 5 shows the retrieval of the last
item. Line 7 shows the method by which the first 4 items are retrieved. Line 7 and 9 are
equivalent. Finally, the last five items are obtained using the command in line 11.
Further examples are shown in Code 6.25. Line 1 obtains every other item in the
tuple. It starts at location 0, ends at location 20, and steps 2 items. The latter number

83
Code 6.23 Length of a collection.

1 >>> a = (5 , 6 , 7 , 4 , 2 , 4 , 6 , ' string ' , ' snow days ' , 3 , 1 , -1 ,


2 ' GMU ' , 5 , 6 , 7 , 8 , 9 , 0 , 0 )
3 >>> len ( a )
4 20

Code 6.24 Slicing examples.

1 >>> a [0]
2 5
3 >>> a [1]
4 6
5 >>> a [-1]
6 0
7 >>> a [:4]
8 (5 , 6 , 7 , 4)
9 >>> a [0:4]
10 (5 , 6 , 7 , 4)
11 >>> a [-5:]
12 (7 , 8 , 9 , 0 , 0)

indicates that it is getting every second item. Line 3 performs the same extraction for the
entire tuple.
Line 5 starts with the last item and ends with the first item. The step value is -1,
so this is stepping backwards through the tuple. It is getting items in reverse order. Line
8 performs the same function for the entire tuple.
Consider the case in Code 6.26 in which a tuple named a is inserted into a tuple
named b. The third item in b is obtained by line 3. This is the entire tuple a. To get
individual components from the inner tuple, two slicing components are required as shown
in line 5. Here b[2] is an entire tuple and thus (b[2])[-1] is the last item in that inner
tuple.
Code 6.27 shows a slightly different process in which the tuple a is inserted into a
list named c. Line appends another item to this list. Line 5 inserts a string in position 1.
Items can be removed from a list in two ways. The first is to use the pop function
shown in Code 6.28. This will remove the item from the list and a variable can be assigned
the value that is removed. The argument to the function is the location of the data that
is to be removed. So, pop(0) means that the first item will be removed from the list and
the variable g will become that first item.
The second method shown in Code 6.29 which uses the remove function. This will
remove an item from the list but the argument must be the data that is to be removed.

84
Code 6.25 More slicing examples.

1 >>> a [0:20:2]
2 (5 , 7 , 2 , 6 , ' snow days ' , 1 , ' GMU ' , 6 , 8 , 0)
3 >>> a [::2]
4 (5 , 7 , 2 , 6 , ' snow days ' , 1 , ' GMU ' , 6 , 8 , 0)
5 >>> a [20:0:-1]
6 (0 , 0 , 9 , 8 , 7 , 6 , 5 , ' GMU ' , -1 , 1 , 3 , ' snow days ' , ' string ' ,
7 6 , 4 , 2 , 4 , 7 , 6)
8 >>> a [::-1]
9 (0 , 0 , 9 , 8 , 7 , 6 , 5 , ' GMU ' , -1 , 1 , 3 , ' snow days ' , ' string ' ,
10 6 , 4 , 2 , 4 , 7 , 6 , 5)

Code 6.26 Accessing a collection inside of a collection.

1 >>> a = (1 ,2 ,3)
2 >>> b = ( ' hi ' , ' hello ' , a , ' guten tag ' )
3 >>> b [2]
4 (1 , 2 , 3)
5 >>> b [2][-1]
6 3

Code 6.27 Insertion into a list.


1 >>> c = [ ' hi ' , ' hello ' , a , ' guten tag ' ]
2 >>> c . append ( ' bon jour ' )
3 >>> c
4 [ ' hi ' , ' hello ' , (1 , 2 , 3) , ' guten tag ' , ' bon jour ' ]
5 >>> c . insert (1 , " G ' day mate " )
6 >>> c
7 [ ' hi ' , " G ' day mate " , ' hello ' , (1 , 2 , 3) , ' guten tag ' , ' bon
jour ' ]

Code 6.28 The pop function.

1 >>> g = c . pop (0)


2 >>> g
3 ' hi '
4 >>> c
5 [ " G ' day mate " , ' hello ' , (1 , 2 , 3) , ' guten tag ' , ' bon jour ' ]

85
If there are two instances of the data then only the first one is removed. For example, if
the list c had two strings “guten tag” then the function in line 1 would have to be called
twice to remove them both.

Code 6.29 The remove function.


1 >>> c . remove ( ' guten tag ' )
2 >>> c
3 [ " G ' day mate " , ' hello ' , (1 , 2 , 3) , ' bon jour ' ]

6.4 Strings

In some cases the available data is represented as characters rather than numerals. For
example, DNA is represented as a string of letters. These long strings are then analyzed
by algorithms. Thus, it is necessary to understand how strings are managed within a
computer program.

6.4.1 String Definition and Slicing

A string is a collection of characters. Strings can be defined by using either single quotes
or double quotes as shown in Code 6.30.

Code 6.30 Creating a string.

1 >>> st1 = ' this is a string . '


2 >>> st2 = " this is also a string . "

Extracting characters from a string is performed through slicing using the same rules
as slicing in a tuple or list. A few examples are shown in Code 6.31. Line 1 retrieves the
first time, line 3 retrieves the first 7 items, and line 5 retrieves the string in reverse order.

Code 6.31 Simple slicing in strings.

1 >>> st1 [0]


2 't '
3 >>> st1 [:7]
4 ' this is '
5 >>> st1 [::-1]
6 ' . gnirts a si siht '

86
6.4.1.1 Special Characters

Code 6.32 shows a string in line 1 that has a \t and a \n. The first is the tab character
and the latter is the newline character. When using the print function the function of
these characters are shown.

Code 6.32 Special characters.

1 >>> astr = ' aaaa \ tbbbb \ nccccc '


2 >>> astr
3 ' aaaa \ tbbbb \ nccccc '
4 >>> print ( astr )
5 aaaa bbbb
6 ccccc

6.4.1.2 Concatenation

A string can not be changed, but it is possible to create a new string from the concatenation
of two strings. An example is shown in Code 6.33. Two strings are created and in line 3
the plus sign is used to create a new string from the two older strings.

Code 6.33 Concatenation.


1 >>> str1 = ' abcde '
2 >>> str2 = " efghi "
3 >>> str3 = str1 + str2
4 >>> str3
5 ' abcdeefghi '

6.4.1.3 Repeating Characters

Creating a string of repeating characters is accomplished by using the multiplication sign.


This is shown in Code 6.34.

Code 6.34 Repeating characters.

1 >>> 5* ' y '


2 ' yyyyy '
3 >>> 5 * ' cat '
4 ' catcatcatcatcat '
5 ' abcdeefghi '

87
6.4.2 Functions

Several functions are defined to manipulate strings and return information about their
contents. Only the functions used in the subsequent chapters are reviewed here.
The find command will find the location of a substring within a string. Three
examples are shown in Code 6.35. Line 1 finds the first occurrence of “is” in string st1.
The function returns 2 which means that the target is found starting at st1[2]. There
are two instances of “is” inside of st1 and this function only returns the first instance.
Line 5 starts the search at position 3 which is after the location of the first occurrence of
“is.” Thus, it finds the second occurrence which is at position 5. Line 9 starts the search
after position 5 and the return is a -1. This indicates that the search found no occurrence
of “is” from the given starting point.

Code 6.35 Using the find function.

1 >>> st1 . find ( ' is ' )


2 2
3 >>> st1 [2:]
4 ' is is a string . '
5 >>> st1 . find ( ' is ' ,3 )
6 5
7 >>> st1 [5:]
8 ' is a string . '
9 >>> st1 . find ( ' is ' ,6 )-
10 1

The count function counts the number of occurrences of a target string. An example
is shown in line 1 of Code 6.36. The rfind function performs a reverse search, or in other
words finds the last occurrence of the target as seen in lines 3 and 4.

Code 6.36 Using the count function.

1 >>> st1 . count ( ' is ' )


2 2
3 >>> st1 . rfind ( ' is ' )
4 5

The case of a string can be forced by the commands upper and lower as shown in
Code 6.37.
The split function will split a string into substrings. This is shown in lines 1 through
3 in Code 6.38. Line 3 shows the result which is a list of strings. The string st1 was split
on a blank space and this blank space is not in any of the substrings in line 3. It is possible
to split on any character (or characters) by placing that character(s) as the argument to

88
Code 6.37 Conversion to upper or lower case.

1 >>> str1 . upper ()


2 ' ABCDE '
3 >>> str1 . lower ()
4 ' abcde '

the function. An example is shown in line 10. Notice that the splitting argument is not
included in any of the strings. The answer list has an empty string because there was
nothing between the first two instances of “is”.

Code 6.38 Using the split and join functions.

1 >>> alist = st1 . split ( ' ' )


2 >>> alist
3 [ ' this ' , ' is ' , ' a ' , ' string . ' ]
4 >>> st3 = ' X ' . join ( alist )
5 >>> st3
6 ' thisXisXaXstring . '
7 >>> st4 = ' ' . join ( alist )
8 >>> st4
9 ' thisisastring . '
10 >>> st4 . split ( ' is ' )
11 [ ' th ' , ' ' , ' astring ' ]

The join function is the opposite of split. The first example is shown in line 4. The
string ’X’ is the glue and as seen in line 6 the join function created a single string that
consists of all of the strings in the list glued together by the string that was in front of the
join command. Line 7 shows the second example and this the glue is an empty string and
as seen in line 9, all of the strings from the list are put together with nothing in between
them.

6.4.2.1 Replacing Characters

It is possible to replace a substring with another using the replace function. Consider
Code 6.39 which starts with the definition of a DNA string in lines 1 and 2. All lowercase
‘a’s are replaced by uppercase in line 3. The result is shown in line 5, and it is possible to
replace more than just single characters as shown in lines 7 through 10.
DNA is a double helix structure and the complement of one helix is contained on the
other strand. To create the complement string the ‘a’ and ‘t’ characters are exchanged.
So, a ‘t’ is located wherever there is an ‘a’ in the original sequence. The letters ‘c’ and ‘g’
are also swapped. Finally, the complement is in reversed order of the original.

89
Code 6.39 Using the replace function.

1 >>> st1 = ' a t g a c t a g c a c t a c g a c g g a c t a c g a c g a c t a c g a c g a c t a c a g c a t c a


2 tttattacgactacag '
3 >>> st2 = st1 . replace ( ' a ' , ' A ' )
4 >>> st2
5 ' AtgActAgcActAcgAcggActAcgAcgActAcgAcgActAcAgcAtcAtttAttAcg
6 ActAcAg '
7 >>> st3 = st1 . replace ( ' at ' , ' AT ' )
8 >>> st3
9 ' ATgactagcactacgacggactacgacgactacgacgactacagcATcATttATtacg
10 actacag '

A swap requires threes steps. It is not possible to just convert all ‘a’s to ‘t’s because
the new string would have both the new and old ‘t’s. So, it is necessary to convert the
‘a’s to something other than the letters contained in the string. This was accomplished
in line 3 in Code 6.39. That was the first of the three steps. The next two are shown in
Code 6.40 where the ‘t’s are converted to ‘a’s and then the old ‘a’s are converted to ‘t’s.
Code 6.40 Creating a complement string.

1 >>> st4 = st2 . replace ( ' t ' , ' a ' )


2 >>> st5 = st4 . replace ( ' A ' , ' t ' )
3 >>> st5
4 ' tagtcatgctcatcgtcggtcatcgtcgtcatcgtcgtcatctgctactaaataatcgt
5 catctg '
6 >>> st6 = st5 . replace ( ' c ' , ' C ' )
7 >>> st7 = st6 . replace ( ' g ' , ' c ' )
8 >>> st8 = st7 . replace ( ' C ' , ' g ' )
9 >>> st9 = st8 [::-1]
10 >>> st9
11 ' ctgtagtcgtaataaatgatgctgtagtcgtcgtagtcgtcgtagtccgtcgtagtgc
12 tagtcat '

The output in lines 4 and 5 show a string where the ‘a’s and ‘t’s have been swapped.
The same process then needs to be applied to swap the ‘c’s and ‘g’s. This is performed
in lines 6 through 8. Finally, the string is reversed in line 9 to finish the creation of the
complement DNA string.

6.4.2.2 Replacing Characters with a Table

In the previous section there were only two types of swaps that needed to be performed.
Other applications may require a much larger array of substitutions and for those the

90
previous method will be cumbersome. A more efficient method uses a look up table. This
process is shown in Code 6.41. Line 2 creates this table using the maketrans function.
This creates a look up table in which each character is the first string is replaced by the
respective character in the second string. Line 3 applies this table to the DNA using the
translate function. The output comp has replaced all of the characters with the new
characters and line 4 reverses the string.

Code 6.41 Using the maketrans and translate functions.

1 >>> import string


2 >>> table = st1 . maketrans ( ' acgt ' , ' tgca ' )
3 >>> comp = st1 . translate ( table )
4 >>> comp = comp [::-1]
5 >>> comp
6 ' ctgtagtcgtaataaatgatgctgtagtcgtcgtagtcgtcgtagtccgtcgtagtgct
7 agtcat '

6.5 Converting Data

A string with numerical data can be converted to a numerical form using the appropriate
command. Examples are shown in Code 6.42. The first two lines convert strings to an
integer or a float. The third line converts a number into a string.

Code 6.42 Converting data.

1 >>> a = int ( ' 4 ' )


2 >>> b = float ( ' 5.6 ' )
3 >>> st = str ( 4 )

6.6 Example: Romeo and Juliet

This section will show a few examples of string manipulation using the play Romeo and
Juliet.
The first question is: Which person is named the most frequently? The answer is
shown in Code 6.43. The data is loaded in lines 1 and 2. Line 3 counts the number of
occurrences of “Romeo” and line 5 counts the number of occurrences of “Juliet.” As seen
Romeo is mentioned significantly more times. Even Tybalt is mentioned more often than
Juliet.
The second question is which person is named first, Romeo or Juliet? Line 1 in Code
6.44 finds the first occurrence of “Romeo” and the returned result is 0. That means that

91
Code 6.43 Counting names in the play.

1 >>> fname = ' data / romeojuliet . txt '


2 >>> data = file ( fname ) . read ()
3 >>> data . count ( ' Romeo ' )
4 130
5 >>> data . count ( ' Juliet ' )
6 48
7 >>> data . count ( ' Tybalt ' )
8 57

the very first word in the file is “Romeo.” This makes sense since the name of the play is
the first part of the file.

Code 6.44 The first Romeo.


1 >>> data . find ( ' Romeo ' )
2 0
3 >>> data . find ( ' SCENE I ' )
4 721
5 >>> data . find ( ' Romeo ' , 721)
6 6570
7 >>> data . find ( ' Juliet ' , 721)
8 18057

So, the search is modified to start after Shakespeare writes “SCENE I.” Line 3 finds
the location of this string and line 5 begins the search at that location thus excluding the
title from the search. As seen the first location of “Romeo” after the play starts is at
position 6570. “Juliet” appears much later than that, so Romeo is mentioned first.
The third question is: which one ended the most sentences that end with a period.
The process is similar except that the search string includes a period. Code 6.45 shows
the results and as seen Romeo wins again.

Code 6.45 Counting Romeo and Juliet at the end of sentences.

1 >>> data . count ( ' Romeo . ' )


2 6
3 >>> data . count ( ' Juliet . ' )
4 5

The fourth question is how many unique words did Shakespeare use. Now, this is
a bit tricky and the results shown are not exactly correct. Currently, the text includes
upper and lower case letters and thus would treat “The” and “the” as different words.

92
Furthermore, all punctuation is included so “Romeo” is counted differently than “Romeo.”
or “Romeo,”. So the results show the upper limit of the number of words and unique words
that were used. Line 1 in Code 6.46 shows that upper limit on total words to be 25,643.
The set command will eliminate duplicates and so this command can be used to find the
number of unique entries which in this case is 6338.

Code 6.46 Collecting individual words.

1 >>> temp = data . split ()


2 >>> len ( temp )
3 25643
4 >>> len ( set ( temp ) )
5 6338

Problems

1. Assign variables aa the value of 4 and bb the value of 9. Compute cc which is the
addition of these numbers.

2. Compute the square of 17.

3. Compute the fourth root of 81.

4. Write Python script to round the following values: 8.2, 4.5, 9.8.

5. Put the following items in a tuple: 5, 6.7, ’a string’, 4, 1+6j. Return the length of
the tuple.

6. In the tuple in the previous problem, retrieve every other item and print to the
console.

7. Convert the tuple created in problem 5 to a list. Append ’New String’ to that list.

8. Create a string of the alphabet and retrieve every third letter.

9. Create a string that is a lowercase alphabet. In a single Python command create a


new string which is the concatenation of the original and the uppercase version of
the original.

10. Create a string that is ’aeiouAEIOU’. Using maketrans create a new string in which
is ’AEIOU,aeiou’.

93
94
Chapter 7

Python Logic Control

Most programming languages have a few commands that control the flow of the program.
These are used to repeatedly perform the same computation or to make decisions. Python
is no exception and control is managed by the if, while and for commands.

7.1 The if Command

The if command steers the program depending on the truth of a condition. For example,
the program has two choices. If x > 5 then it would do one thing, but x is less than or
equal to 5 then it would do something else. This is a decision. A simple example is shown
in Code 7.1 where the decision is made in line 1. If c > 5 then the program would do
whatever is in lines 2 and 3.

Code 7.1 The skeleton for a for loop.

1 if c >5:
2 command 1
3 command 2

Python is heavily reliant on the use of indentations. The if command ends with a
colon and then in this example the next two lines are indented. All of the lines that are
indented are inside of the if statement.
Python indentations must be consistent throughout. In the previous code the in-
dentations are 4 spaces. It is important that the commands have exactly the same number
of spaces. The program will not execute if line 2 has an indentation of 3 spaces and line
3 has an indentation of 4 spaces.
Editors such as IDLE have the default setting of inserting 4 white spaces when the
TAB button is pressed. Other, editors, however, may be set up to insert a TAB character

95
when that same button is pressed. Even though the TAB indentation may look the same
as a 4 space indentation they are not the same. Python compilers will not accept this
situation. It is prudent to ensure that all Python editors a user employs uses 4 spaces for
indentations.
A working example of the if statement is shown in Code 7.2. The variable x is set
to 6 in line 1 and in line 2 the program checks to see if x is greater than 4. This is True
so it then executes line 3 and the result is shown in line 5.
Code 7.2 The if statement.
1 >>> x = 6
2 >>> if x >4:
3 print ( ' Yes ' )
4

5 Yes

This example used the greater than comparison. There are many comparisons as
shown in the following list.

ˆ Greater than : >


ˆ Less than : <
ˆ Equals to: ==
ˆ Greater than or equals to: >=
ˆ Less than or equals to: <=
ˆ Not equal to: ! =

A second example is shown in Code 7.3 where two commands are executed if the if
statement is true. It line 1 was changed to if x<4: then neither one of the print statements
would be executed.
Code 7.3 Two commands inside of an if statement.
1 >>> if x >4:
2 print ( ' Yes ' )
3 print ( ' More yes ' )
4

5 Yes
6 More yes

7.1.1 The else Command

Code 7.4 uses the else statment. This is used to execute commands if the if condition
is false. So, in this case if line 1 was true then lines 2 and 3 are executed. If line 1 is false
then line 5 is executed.

96
Code 7.4 Using the else statement.

1 >>> if x >4:
2 print ( ' Yes ' )
3 print ( ' More yes ' )
4 else :
5 print ( ' No ' )

7.1.2 Complex Conditions

The condition for the if statement can include multiple tests as shown in Code 7.5. The
condition in line 3 uses the and and therefore both conditions must be true in order to
execute line 4. There are three words that are used in complex conditions:

ˆ and
ˆ not
ˆ or

Code 7.5 A compound statement.

1 >>> x = 6
2 >>> y = 5
3 >>> if x >4 and y >3:
4 print ( ' OK ' )
5

6 OK

Similar to other languages, Python has a particular order in which the conditions
are tested. If the conjunctions are the same (perhaps two and statements) then they are
evaluated in order of appearance. if the conjunctions are different (perhaps an and and
an or) then Python employs a structured hierarchy.
Consider 7.6 which has three conditions. The first condition is c>2 which is True
and the second is a<0 which is False. The conjunction between them is or and so this
combination should result in True. The next condition is b<0 which is False and the
preceding conjunction is and and therefore the entire statement should be False. If this
were the case then the word ’Yes’ would not be printed to the console. Yet, this is exactly
what occurred.
Python uses a decision hierarchy in which all ands are considered before the ors.
Thus in this case, the a<0 and b<0 is evaluated first. This if False. Then the next
evaluation is c>2 or False which is True and the statement is printed to the console.
Parenthesis are used to control the order of evaluation and Code 7.7 shows that c>2
or a<0 can be evaluated first.

97
Code 7.6 A compound statement.

1 >>> a =1
2 >>> b =2
3 >>> c =3
4 >>> if c >2 or a <0 and b <0:
5 print ( ' Yes ' )

Code 7.7 Using parenthesis in a compound statement.

1 >>> a =1
2 >>> b =2
3 >>> c =3
4 >>> if ( c >2 or a <0 ) and b <0:
5 print ( ' Yes ' )

7.1.3 The elif Statement

The elif statement is equivalent to else-if. This is inside of another if statement as


shown in Code 7.8. Line 1 determines if a is less than 3. If this is false then the condition
in line 3 will be considered. If the statement in line 1 is true then the statement in line 3
will not be considered. Line 6 is executed only if lines 1 and 3 are both false.

Code 7.8 Using the elif statement.

1 >>> if a <3:
2 print ( ' Yes ' )
3 elif b >0:
4 print ( ' No ' )
5 else :
6 print ( ' Maybe ' )
7

8 Yes

7.2 Iterations

Iterations are used to perform the same commands repeatedly. There are two main meth-
ods that this is accomplished. These are the while and for loops.

98
7.2.1 The while Loop

The while loop will repeatedly perform the same steps until a condition becomes False.
The Code 7.9 sets anum equal to 0 in line 1. Line 2 starts the while loop and as long as
anum is less than 4 it will execute lines 3 and 4. However, line 4 changes the value of anum
and eventually it becomes equal to 4 and the condition in line 2 is no longer True. Then
Python does not execute lines 3 and 4 and goes on to any steps that are after the while
loop. The condition statement can also use parenthesis and the keywords and, or or not.

Code 7.9 Using a while loop.

1 >>> anum = 0
2 >>> while anum < 4:
3 print ( anum )
4 anum = anum + 1
5

6 0
7 1
8 2
9 3

7.2.2 The for Loop

The for loop performs iterations but over a finite list or tuple of items. Line 1 in Code
7.10 creates a list named blist. The for loop is created in line 2 and the variable i will
become each item in the list. So, line 3 is executed four times and i is a different item in
the list in each of the iterations.

Code 7.10 Using a for loop.

1 >>> blist = [1 , ' GMU ' , ' snow days ' , 2 ]


2 >>> for i in blist :
3 print ( i )
4

5 1
6 GMU
7 snow days
8 2

In many applications it is desired that i just be an incrementing integer. For these


applications the range function is used to create a collection of integers for the iteration
variable. In earlier versions of Python the range function created a list of integers as shown
in Code 7.11. If one argument is used then range creates a list of integers starting with 0

99
and going up to but not including the argument in the command. If two arguments are
used (line 3) then they define the starting and ending parts of the list. If three commands
are used (line 5) then they represent start, stop and step.

Code 7.11 The range function in Python 2.7.

1 >>> range ( 10 )
2 [0 , 1, 2, 3, 4, 5, 6, 7 , 8 , 9]
3 >>> range (2 , 10)
4 [2 , 3, 4, 5, 6, 7, 8, 9]
5 >>> range ( 2 , 10 , 2 )
6 [2 , 4 , 6 , 8]
7 >>> for i in range ( 5 ):
8 print ( i , end = ' ' ) # py 3.4
9 print i , # py 2.7
10

11 0 1 2 3 4
12 >>> list ( range ( 10 ) ) # py 3.4

The range function is changed in Python 3.x and it no longer returns a list. However,
this is easily remedied by converting the return using the list function as shown in line
12.
The for loop in line 7 uses the range command. Thus, i will become integers 0
through 4. The comma after the print statement is used to prevent Python from printing
a new line with each iteration. Thus, the output appears together on line 10.

7.2.3 break and continue

Consider Code 7.12 in which an if statement resides inside of a for loop. For a couple
of iterations the if statement in line 3 is False. Eventually, it becomes True and then
line 4 is executed. The only command is the break command which immediately takes
the program outside of the for loop. To demonstrate this there are two print statements.
When i is 0, 1, or 2, the if statement is False and the print statement in line 5 is executed.
However, when i = 3 then line 2 is printed. Next the if statement is evaluate to be True
and then the break command is executed immediately stopping the iterations in the for
loop. As seen in line 10 the ‘CCC’ was not printed when i was 3. Furthermore, the value
of i never becomes 4.
The continue command is related to the break command. Code 7.13 shows an
example. Line 4 contains the continue command, and this command will terminate the
current iteration but will allow subsequent iterations to proceed. As seen in line 10, the
value of i does become 4.

100
Code 7.12 Using the break statement.

1 >>> for i in range ( 5 ) :


2 print ( i , end = ' ' ) # py 3.4
3 print i , # py 2.7
4 if i > 2:
5 break
6 print ( ' CCC ' )
7

8 0 CCC
9 1 CCC
10 2 CCC
11 3

Code 7.13 Using the continue statement.

1 >>> for i in range ( 5 ) :


2 print i ,
3 if i > 2:
4 continue
5 print ( ' CCC ' )
6

7 0 CCC
8 1 CCC
9 2 CCC
10 3 4

101
7.2.4 The enumerate Function

Data that comes in a list may need to have indexes to assist in further programming. This
is accomplished with the enumerate function. Line 1 in Code 7.14 creates a list of five
workdays by name. The for loop uses the enumerate function to return the index and
the value of the data from the original list.

Code 7.14 Using the enumerate function.

1 >>> adata = ( ' Monday ' , ' Tuesday ' , ' Wednesday ' , ' Thursday ' , '
Friday ' )
2 >>> for a , b in enumerate ( adata ) :
3 print ( a , b )
4

5 0 Monday
6 1 Tuesday
7 2 Wednesday
8 3 Thursday
9 4 Friday

7.3 Examples

This section displays several examples using the combination of control statements.

7.3.1 Example: The Average of Random Numbers

This example is to compute the average from a set of random numbers. Python offers the
function random that will generate a random number between 0 and 1. This function will
generate random numbers that are evenly distributed, which means that there is an equal
chance of getting a random number near 0.1 as one near 0.5. Thus, the average of many
random numbers should be very close to 0.5. This function is in the random module and
is shown in Code 7.15.
Code 7.15 Generating random numbers.

1 >>> import random


2 >>> random . random ()
3 0.8784 93 16 91 11 47 31

Code 7.16 creates an empty list and then fills it with 1000 random numbers. In line
3 a random number is generated and put in a variable named r, and Line 4 appends this
to the list.

102
Code 7.16 Collecting random numbers.

1 >>> coll = []
2 >>> for i in range ( 1000 ) :
3 r = random . random ()
4 coll . append ( r )

Code 7.16 is not the most efficient manner in which this can be done, but it shows
the steps involved.
The average of a set of numbers is computed by,

N
1 X
a= xi . (7.1)
N
i=1

Here the individual variables are xi where i goes from 1 to N , where N is the number of
samples. The computation of the average is shown in Code 7.17. In line 1 the variable sm
is set to 0. The loop in lines 2 and 3 computes the sum of all of the numbers in the list
coll. Line 5 divides by the total number of samples. As seen, the average is very close to
0.5. If a large data set is used then the average will be even closer to 0.5.

Code 7.17 Computing the average.

1 >>> sm = 0
2 >>> for i in coll :
3 sm = sm + i
4

5 >>> sm /1000.
6 0.495 51 4 1 09 8 5 10 7 7 63

The codes that have been shown are not the most efficient method of implementation.
Functions from the numpy module can improve both coding efficiency and performance
speed. Line 2 in Code 7.18 creates a vector (an array) of 1000 random numbers and line 3
computes the average of that vector. Again the average is close to 0.5. These commands
are reviewed in Chapter 11.

Code 7.18 A more efficient method.


1 >>> import numpy as np
2 >>> c = np . random . rand (1000)
3 >>> c . mean ()
4 0.509 65 5 0 12 0 0 94 7 1 83

103
7.3.2 Example: Text Search

In this example the task is to find all of the words that follow the letters ‘the’. The text
that will be used will be converted to lowercase. This search will look for the letters ’the’
followed by a space. However, this process is not perfect and it will consider a word like
‘bathe’ to be a positive hit since it will end with ‘the ’ (including a space and the end).
Code 7.19 shows the script for loading the text file that contains the script from Romeo
and Juliet.

Code 7.19 Loading Romeo and Juliet.

1 >>> fname = ' data / romeojuliet . txt '


2 >>> data = open ( fname ) . read ()
3 >>> data = data . lower ()

The real work is down in Code 7.20. Line 1 starts with an empty list named answ.
The for loop started in line 2 will set the variable to integer values to 3 less then the
length of the string. The if statement in line 3 then determines if the string at location i
through i+4 has the four characters ‘the ’ (including the space). If it does then the next
step is to isolate the word that follows that space. That word will start at i+5 but the
where that word ends is unknown. So, line 4 finds the location of the next space. This
location is stored in the variable end. Thus the next word after the ‘the ’ starts at location
i+4 and ends at the location end. This is the word and it is appended to the list answ.

Code 7.20 Capturing all of the words that follow ‘the ’.

1 >>> answ = []
2 >>> for i in range ( len ( data )-3) :
3 if data [ i : i +4] == ' the ' :
4 end = data . find ( ' ' ,i +4 )
5 answ . append ( data [ i +4: end ] )
6

7 >>> len ( answ )


8 672
9 >>> answ [:10]
10 [ ' fatal ' , ' fearful ' , ' continuance ' , ' two ' , ' which ' , ' house ' ,
11 ' collar .\ n \ nsampson \ n \ n ' , ' house ' , ' wall ' , ' weakest ' ]

Line 7 shows that there are 672 entries in this list and line 9 prints out the first 10
of these. These are some of the words that follow ‘the ’ in the play Romeo and Juliet.
There may be duplicates in this list and they can be remove by using the set and
list commands as show in Code 7.21. The set command will remove the duplicates as
it creates a set. The list command converts that result back into a list.

104
Code 7.21 Isolating unique words.

1 >>> unique = list ( set ( answ ) )


2 >>> len ( unique )
3 559

7.3.3 Example: Sliding Block

Figure 7.1 shows the sliding box problem in which a box slides (without friction) down
an inclined plane with an angle of θ to the horizontal. The acceleration that the box
experiences is
a = g sin θ, (7.2)

where g is the acceleration due to gravity. Thus, the speed of the box is computed by,

v = gt sin θ. (7.3)

where t is the time. In this example a Python script is written to calculate the velocity of
the box at specific times.

Figure 7.1: The sliding box problem.

The process is shown in Code 7.22. For this problem there are two functions that
are needed from the math module. The sin function computes the sine of an angle.
However, Python, like all computer languages uses radians instead of degrees for the
angles. Therefore, the radians function is used to convert the angle from degrees to
radians. These two functions are imported in line 1. Line 2 sets the gravity constant to
9.8 and line 3 sets the angle theta to 20 degrees.
In this example, 10 time steps are printed and this process begins with the for loop
in line 4. The task is to compute the velocity for every quarter of a second. So, the variable
t is one-fourth of the integer i as computed in line 5. Line 6 computes the velocity and
prints it to the console. Four items are printed. The first two are the variables i and t.
The third item is a tab character which is used to make the output look nice. Finally, the
velocity at each individual time is printed.

105
Code 7.22 Computations for the sliding block.

1 >>> from math import radians , sin


2 >>> g = 9.8
3 >>> theta = radians (20)
4 >>> for i in range ( 10) :
5 t = i /4.
6 print ( i , t , ' \ t ' , g * t * sin ( theta ) )
7

8 0 0.0 0.0
9 1 0.25 0.837949351148
10 2 0.5 1.6758987023
11 3 0.75 2.51384805344
12 4 1.0 3.35179740459
13 5 1.25 4.18974675574
14 6 1.5 5.02769610689
15 7 1.75 5.86564545804
16 8 2.0 6.70359480918
17 9 2.25 7.54154416033

7.3.4 Example: Compute π

In this example the value of π is calculated using random numbers. Consider Figure 7.2
which has a square that has a length and height of 2. Inside of this square is a circle with
a radius of 1.

Figure 7.2: A circle inscribed in a square.

The area of the square is,


A1 = 2 × 2 = 4. (7.4)

and the area of the circle is πr2 but since r = 1 in this case the area is just ,

A2 = πr2 = π. (7.5)

106
Now consider that a dart is thrown at Figure 7.2 and it lands inside of the square.
There is also a probability that it will land inside of the circle. The probability of the dart
landing inside of the circle is,
A2 π
p= = . (7.6)
A1 4
Thus, p = π4 or in other words, π = 4p. So, if the value of p can be determined then the
value of π can be determined. Now consider the idea of throwing thousands of darts at
the image. The probability p is the total number of darts that land inside of the circle
divided by the total number of darts. The question is then, how can we write a program
to throw these darts?
A dart is a random location inside of the square. This can be defined by a point
(x, y) where both x and y are random numbers between -1 and +1. Any dart that is inside
of the circle has a distance of less than 1 to the center of the circle. The distance from the
center of the circle to a point (x, y) is determined by,
p
d= x2 + y 2 . (7.7)

Code 7.23 shows the process to perform these computations. Line 1 creates the
variable total which will count the total number of darts. Line 2 creates the variable
incirc will will count the total number of darts inside of the circle. Both of these are
initialized to 0.

Code 7.23 Computing π with random numbers.

1 >>> total = 0
2 >>> incirc = 0
3 >>> for i in range ( 1000000 ) :
4 x = 2 * random . random () - 1
5 y = 2 * random . random () - 1
6 d = sqrt ( x * x + y * y )
7 if d < 1:
8 incirc = incirc + 1
9 total = total + 1
10

11 >>> float ( incirc ) / total * 4


12 3.142148

The for loop starts in line 3 and will iterate one million times. Lines 4 and 5 create
the random (x, y) by creating two random numbers between -1 and +1. The distance to
the center is computed in line 6. If this value is less than one then the value incirc is
increased by one. This is counting this particular dart as being inside of the circle. Every
dart gets counted in line 9. Finally, π = 4p is compute in line 11. As seen the result in
line 12 is quite close to the value of π.

107
7.3.5 Example: Summation Equations

This section will demonstrate the process of converting a summation equation into Python
scripts. Consider the case where the initial data is a small tuple as shown in line 1 of Code
7.24. The len function returns the length of the tuple as seen in lines 2 and 3. The range
function returns a list that starts with 0 and increments up to the given number as seen
in lines 4 and 5.

Code 7.24 The initial data.


1 >>> x = (1 ,2 ,5 ,6)
2 >>> len ( x )
3 4
4 >>> list ( range ( len ( x ) ) )
5 [0 , 1 , 2 , 3]

The first task is to compute the summation,


N
X
a= xi . (7.8)
i=1

This is equivalent to,


a = x0 + x1 + x2 + x3 + x4 . (7.9)

This is accomplished by a for loop as shown in Code 7.25. The answer is placed in
a variable named answ which is initialized to 0 in line 1. Lines 2 and 3 are inside the for
loop. Table 7.1 shows the value of each variable through each iteration. The final answer
is printed to the console in line 6.

Code 7.25 Summing the values.

1 >>> answ = 0
2 >>> for i in range ( len ( x ) ) :
3 temp = answ + x [ i ]
4 answ = temp
5 >>> answ
6 14

A modified task is to compute the answer to,


X
a= 2xi . (7.10)
i

Code 7.26 shows the same process with the necessary modification in line 3. Note that
line 1 is required since the variable answ was changed in Code 7.25.

108
Table 7.1: Values of the variables during each iteration

Line 3 Line 4
i x[i] answ temp answ
0 1 0 1 1
1 2 1 3 3
2 5 3 8 8
3 6 8 14 14

Code 7.26 More efficient code.


1 >>> answ = 0
2 >>> for i in range ( len ( x ) ) :
3 answ = answ + 2 * x [ i ]
4 >>> answ
5 28

Consider a slightly different equation,


1 X
a= xi .
N
i

In this case the N1 is outside of the for loop. Therefore, the loop is completed before
multiplied by the fraction as shown in Code 7.27. Lines 1 through 3 are the same as
in Code 7.25. The loop is completed before line 5 is executed. Now, the summation is
complete and the the for loop is finished. Line 5 performs the multiplication by N1 and
the numerator is a floating point value so that the fraction is also floating point.

Code 7.27 Code for the average function.

1 >>> answ = 0
2 >>> for i in range ( len ( x ) ) :
3 answ = answ + x [ i ]
4

5 >>> answ = answ * (1./ N )

7.4 Problems

1. Write a Python script that sets x = 9 and y = 10. The script it prints to the console
YES if x > y and NO otherwise.

2. Set a = 0, b = 1 and c = 2. Write a script that prints YES is the value of b is


between a and c. Test this script again setting b = 4.

109
3. Create a while loop that starts with x = 0 and increments x until x is equal to 5.
Each iteration should print to the console.

4. Repeat the previous problem, but the loop will skip printing x = 5 to the console
but will print values of x from 6 to 10.

5. Create a for loop that prints values from 4 to 10 to the console.

6. Create a list of 10 data points in the form of (x, y). The values of these points can
be randomly assigned. Write a Python script in which both the x and y values are
used as indexes in the for loop. Print the values for each iteration.

7. Using the random dart method show that the area of a right triangle is half of the
area of the bounding box.

8. Using the random dart method show that the area of any triangle is half of the area
of the bounding box. The user should define the triangle by defining the corners as
three points in space.

9. Section 7.3.4 uses a circle that is inside of a square. Using the random dart method
compute the area of a square that is inside of a unit circle with all four corners
touching the circle.

110
Chapter 8

Input and Output

This chapter reviews methods in which Python can read and save text files.

8.1 Reading a File

There are three basic steps to reading data from a file on the hard drive. These are:

1. Open a file,
2. Read the data, and
3. Close the file.

Consider a case in which the text file “mydata.txt” already exists on the hard drive
and the goal is to read this data into the Python. The three steps are shown in Code 8.1.
Line 1 opens the file using the file command. Newer versions of Python use the open
command instead. The variable fp is a file pointer and contains information about where
the file exists on the hard drive and the position of the reader. When the file is opened
the position is at the beginning of the file but this can be altered by the user as shown in
Section 8.3.
Code 8.1 Reading a file.

1 >>> fp = file ( ' mydata . txt ' ) # Py 2.7


2 >>> fp = open ( ' mydata . txt ' ) # Py 3.4 or 2.7
3 >>> data = fp . read ()
4 >>> fp . close ()

Line 3 reads the entire file into Python. The variable data is a string. if the data is
numerical in nature then it will be necessary to convert the string into a numerical value.
This is discussed in Section 6.5. Line 4 closes the file. It is good practice to close files
when access is finished.

111
Code 8.1 assumes that the data file is also in the current working directory. If that
is not the case then it is necessary to include the file structure when reading a file. The
example using a full file structure is shown in line 1 in Code 8.2. Line 2 shows the case of
reading data that is in a subdirectory of the current working directory.

Code 8.2 Accessing files in another directory.

1 >>> fp = open ( ' C :/ science / data / sales . txt ' )


2 >>> fp = open ( ' data . txt ' )
3 # alternate
4 >>> data = open ( fname ) . read ()

Code 8.2 uses forward slashes to delineate the directories in the structure. This
is the style used in UNIX and OSx systems. Windows uses backslashes to delineate the
subdirectories. However, backslashes are also used to denote special characters such as a
tab (\t) or newline (\n). The Python solution is that the two backslashes can be used to
delineate the directory structure, or the forward slashes will still work in Windows.
It is possible to open, read and close a file in a single command. This shortcut is
shown in line 4 in Code 8.2.

8.2 Storing Data in a File

Storing data in a file follows a similar process in that the protocol is:

1. Open the file,


2. Read the data, and
3. Close the file.

These steps are shown in Code 8.3. Line 1 opens the file, and it should be noted that
if a file with the “output.txt” previously existed that line 1 will eliminate that file. There
is no warning, if line 1 is executed then the previous file is gone for good. The argument
’w’ is the flag that indicates that this file is open for writing.

Code 8.3 Opening a file for writing.

1 >>> fp1 = open ( ' output . txt ' , ' w ' )


2 >>> fp1 . write ( indata )
3 >>> fp1 . close ()

Line 2 writes the string indata to the file and line 3 closes the file. The only data
that can be written by this method is a string. If the data is numerical then it must be
converted to a string before it can be saved.

112
The methods shown save the data as an text file. The advantage is that the data
can be read on any platform. So, a file can be stored on Windows and then read on a
Mac. The disadvantage is that the files can become large. The alternative is to store data
in binary format which has just the opposite features. Files are not easily transformed
from one platform to the next because Windows and Mac store binary data differently.
However, the files can be smaller particularly for a lot of numerical data. Code 8.4 shows
the lines for opening a file for writing and reading binary data.

Code 8.4 Opening a file for writing.

1 >>> fp = open ( ' output . txt ' , ' wb ' )


2 >>> fp = open ( ' output . txt ' , ' rb ' )

8.3 Moving the Position in the File

Modern biological labs rely on robots and computers to process and collect the data. The
experiments can process a large array of data and store it all in a single file. The files will
have several components such as information about the protocol, date, users, experiment,
data locations within the file and the raw data.
Reading such a file can be done in two ways. One is to load the entire file, which
can be several megabytes, and then process the data. The second is to move about the file
stored on the disk and extract the pertinent components. For example, early sequencers
would produce a file that had header information (date, etc.) which included the location
of the information about the data. This was at the end of the file. So, it was necessary
to jump towards that section of the file and then read the information about where the
raw data was kept in the file. Then the program needed to move backwards in the file to
where the raw data was stored.
So, it is necessary to have the ability to move about a file so that specific components
can be read. This is accomplished with the seek command. Code 8.5 shows an example.
Line 1 opens the file and line 2 moves the position to the 6th byte in the file. The read
command in line 3 has an integer argument which is the number of bytes to read. In this
case, only 1 byte is read. Line 4 repositions the file to the 3rd byte from the end and then
another single byte is read. This is just a very simple example, but these are the steps
that are used to move about a file and read in a specific number of bytes.
The current position in the file is returned by the fp.tell() command..

8.4 Pickle

Python offers terrific collections such as tuples and lists. Saving this data would be difficult
if every component was required to be converted to a string. The pickle module offers the

113
Code 8.5 Using the seek command.

1 >>> fp = open ( ' workfile ' )


2 >>> fp . seek (5) # Go to the 6 th byte in the file
3 >>> print ( fp . read (1) )
4 >>> fp . seek (-3 , 2) # Go to the 3 rd byte before the end
5 >>> print ( fp . read (1) )
6 >>> fp . close ()

ability to store multiple types of data in a single command. The process is shown in Code
8.6 which starts with the created of a tuple that contains another tuple. Line 3 opens the
file for writing in the same manner as a binary file (required since Python 3.x). Whenever
a file is opened with the command open it will destroy any file with the same name. There
is no warning and no Ctl-Z that can reverse the deed. Once the command is executed the
previous file with the same name is gone.

Code 8.6 Saving data with the pickle module.

1 >>> atup = ( 5 , 6.7 , ' string ' )


2 >>> blist = (9 , -1 , atup , ' more ' )
3 >>> fp = open ( ' dud . txt ' , ' wb ' )
4 >>> import pickle
5 >>> pickle . dump ( blist , fp )
6 >>> fp . close ()

Line 4 imports the pickle module and line 5 shows the single dump command that
stores everything in a text file. Code 8.7 shows the method to read in a pickled file. The
file is opened in the normal manner but the reading is performed by the load command.
As seen data is the nested tuple that was created in Code 8.6.

Code 8.7 Loading data from the pickle module.

1 >>> fp = open ( ' dud . txt ' , ' rb ' )


2 >>> data = pickle . load ( fp )
3 >>> fp . close ()
4 >>> data
5 (9 , -1 , (5 , 6.7 , ' string ' ) , ' more ' )

8.5 Examples

This section presents several examples in Python.

114
8.5.1 Sliding Window in DNA

In this example a DNA string will be analyzed to compute the percentage of ‘t’s within
a sliding window. The first step is to load the DNA data. Line 1 in Code 8.8 shows the
command to read in the file as a single long string. In this case fname is the name of
the file that contains the data. Line 3 shows that this is a very long string with 4 million
characters.
Long strings should NEVER be printed in the IDLE shell. The process is uses for
printing will take an extremely long time. It is possible to print out a small portion of the
string as shown in Line 4.

Code 8.8 Reading the DNA file.

1 >>> dna = open ( fname ) . read ()


2 >>> len ( dna )
3 4403837
4 >>> dna [:100]
5 ' ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcggtcgt
6 ctccgaacttaacggcgaccctaaggttgacgacggacccagcagtgatg '

In this task the goal is to compute the percentage of ‘t’s in a window of 10 characters.
This window is then moved to the next 10 characters and the the ‘t’ percentage is calculated
for this new window. Line 1 in Code 8.9 shows the command to print the first ten
characters. Line 3 counts the number of ‘t’s in this small string. Line 5 computes the
percentage of ‘t’s in this small string. Line 7 computes the percentage for a different set
of 10 characters starting at position 200.

Code 8.9 Counting the occurrences of the letter ‘t’.

1 >>> dna [0:10]


2 ' ttgaccgatg '
3 >>> dna [0:10]. count ( ' t ' )
4 3
5 >>> dna [0:10]. count ( ' t ' ) /10.
6 0.3
7 >>> dna [200:200+10]. count ( ' t ' ) /10.
8 0.4

The goal is to consider 10 characters in one window which starts at position k. Then
this window is moved to the next position which is at k + 10. Code 8.10 shows this task.
First an empty list is created in line 1 which will catch the answers as they are generated.
The pct is the percentage of ‘t’s for that window. Note that the percentage uses a floating
point 10 instead of an integer. This percentage is appended to the list in line 5 and the
answer is shown. These are the percentages of ‘t’s in a sliding window of length 10.

115
Code 8.10 A sliding window count.

1 >>> answ = []
2 >>> for i in range ( 0 , 100 , 10 ) :
3 count = dna [ i : i +10]. count ( ' t ' )
4 pct = count /10.0
5 answ . append ( pct )
6 >>> answ
7 [0.3 , 0.2 , 0.2 , 0.2 , 0.2 , 0.3 , 0.0 , 0.3 , 0.0 , 0.2]

The DNA string, however, is longer than 100 characters. So, a small modification is
needed in order to compute the percentages for the sliding window for the entire string.
The value of 100 needs to be replaced by the length of the string. The change is shown
in Code 8.11. Line 2 replaces the end value with len(dna). The answer is now a list of
over 400,000 numbers. It is highly recommended that the entire list NOT be printed to
the console.

Code 8.11 The sliding window for the entire string.

1 >>> answ = []
2 >>> for i in range ( 0 , len ( dna ) , 10 ) :
3 count = dna [ i : i +10]. count ( ' t ' )
4 pct = count /10.0
5 answ . append ( pct )
6 >>> len ( answ )
7 440384

8.5.2 Example: Reading a Spreadsheet

This example shows a method by which a spreadsheet page can be read and parsed in
Python. The first step, of course, is to save the page from the spreadsheet as a tab
delimited file. The sample data is shown in the spreadsheet in Figure 8.1. This data is
saved as a tab delimited text file named sales.txt.
Code 8.12 shows the command to read in the file. The variable sales is a string
that has 1152 characters. The first 100 characters are printed and as seen this is the top
row of the spreadsheet. Each column is separated by a tab (\t) and each row is separated
by a newline character (\n).
Code 8.13 shows the steps in parsing the first line of data. Line 1 uses the split
command to separate the data into a list of strings. Each string is one row from the
spreadsheet. So lines[0] is the first row of the spreadsheet as shown in line 3. Line
4 splits that single string on the tab characters and thus the first row becomes a list of

116
Figure 8.1: Data in a spreadsheet.

Code 8.12 Reading the sales data.

1 >>> fname = ' data / sales . txt '


2 >>> sales = file ( fname ) . read ()
3 >>> len ( sales )
4 1152
5 >>> sales [:100]
6 ' Item \ tPrice \ tDelivery Charge \ tOrdered This Month \ nBath
Towels
7 \ t6 .95\ t5 .00\ t319 \ nBathroom Radio \ t24 .95\ t8 .00\ t15 '

117
strings where each string is one cell in the spreadsheet. This is shown in line 6.

Code 8.13 Splitting the data on newline and tab.

1 >>> lines = sales . split ( ' \ n ' )


2 >>> lines [0]
3 ' Item \ tPrice \ tDelivery Charge \ tOrdered This Month '
4 >>> heads = lines [0]. split ( ' \ t ' )
5 >>> heads
6 [ ' Item ' , ' Price ' , ' Delivery Charge ' , ' Ordered This Month ' ]

The data starts in the second line of the of the spreadhsheet and line 1 in Code
8.14 splits this line into its constituents. Note that in line 2 all of the items in the list are
strings.

Code 8.14 Splitting the first data line.

1 >>> lines [1]. split ( ' \ t ' )


2 [ ' Bath Towels ' , ' 6.95 ' , ' 5.00 ' , ' 319 ' ]

Code 8.15 shows the method by which the data is read and converted to floats for a
single line. The list temp is created in line 1. Line 3 splits lines[1] into its constituents
which is the same as line 2 in Code 8.14. The first item a[0] is the string Bath Towels
and therefore the conversion to numerical starts with a[1]. The for loop starts at 1 and
line 4 converts each of the numerical items to a float.

Code 8.15 Converting data to floats.

1 >>> temp = []
2 >>> a = lines [1]. split ( ' \ t ' )
3 >>> for j in range (1 , 4) :
4 temp . append ( float ( a [ j ] ) )
5 >>> temp
6 [6.95 , 5.0 , 319.0]

The entire process is shown in Code 8.16. It should be noted that the text file has
one line at the end of the file that is empty. This is normal when a spreadsheet page
is saved as a text file. Line 1 creates an empty list that will hold all of the numerical
data. Line 2 starts the for loop which excludes the first line since it has header data
and excludes the last line since it is known to be empty. Line 3 is the same process as in
Code 8.15 except that the process is applied to all rows as the outer loop goes through its
iterations.

118
Code 8.16 Converting all of the data.

1 >>> answ = []
2 >>> for i in range ( 1 , len ( lines )-1 ) :
3 temp = []
4 for j in range ( 1 , 4 ) :
5 a = lines [ i ]. split ( ' \ t ' )
6 temp . append ( float ( a [ j ]) )
7 answ . append ( temp )

Problems

1. Show that the DNA string contains only four letters.

2. In the DNA string there are regions that have a repeating letter. What is the letter
and length of the longest repeating region?

3. How many ’ATG’s are in the DNA string?

4. In Romeo and Juliet retrieve all of the capitalized words that do not start a sentence.
Use set and list to remove duplicates from this list.

5. Does the phrase “Juliet and Romeo” exist in the play?

6. Return a list of all of the locations of the word “Juliet”.

7. What is the largest distance (number of characters) between two consecutive in-
stances of the word “Juliet”. (The previous problem will be of assistance.)

8. What is the most common word in Romeo and Juliet that is at least 5 letters long?

119
120
Chapter 9

Python and Excel

There are now many online archives of biological data and often this data is available
in the form of a spreadsheet. This chapter will review the different methods by which
spreadsheets can be read by Python.
In the first method the user would save the spreadsheet page as a tab delimited text
file and then use Python to read the file and parse the data. The second method reads
that same file using the csv module. When the spreadsheet saves as a text file only the
data is saved. Plots, charts, equations and formatting are all lost. There are modules that
allow the user to read and write spreadsheets including these features. The third section
in this chapter reviews the xlrd module which can read a spreadsheet file directly. The
final method is the openpyxl which can read and write to the .xlsx format. While these
latter two methods can write to spreadsheets, this chapter only reviews the methods of
reading the data. There are many aspects of these modules which are not covered here.

9.1 Text Files

The first method requires that the user save the spreadsheet page as a tab delimited text
file. This saves only a single page and only the data therein. It is important to use the tab
delimited option instead of the comma delimited option because some of the fields like a
gene name may contain commas.
A spreadsheet can be saved in many different formats. In LibreOffice the user selects
the File menu and the Save As option. At the bottom of the dialog there is an option
to change the format of the file to be saved. The selection is changed to Text CSV. The
first pop up dialog is shown in Figure 9.1(a) and the “Use Text CSV Format” should be
selected. The creates a second dialog that is shown in Figure 9.1(b). Here the user needs
to make the correct choices as shown in the first two fields. UTF-8 is standard text format
and Tab is selected as the delimiter.
The output is a text file which contains the data from the spreadsheet. Each cell is

121
(a) (b)

Figure 9.1: Two pop up dialogs.

separated by a Tab and each row is separated by the newline character which appears as ‘
n’ when displayed. Figure 9.2 shows two parts of a very large spreadsheet and Code 9.1
loads the data in line 2.

(a)

(b)

Figure 9.2: Parts of a large spreadsheet.

Code 9.1 Loading the data.

1 >>> fname = ' marray / GSM151667 . csv '


2 >>> data = open ( fname ) . read ()
3 >>> len ( data )
4 697781
5 >>> data [500232:500274]
6 ' 0.993095\ t1 .688044\ t1 \ t1 \ t0 \ n883 \ t2 \ t1 \ t5 \ t3 \ tLGALS2 '

The data is almost 700,000 characters and this is far too much to print to the console,
so only a portion is printed in line 6. The first number is 0.993095 which corresponds to the
cell highlighted in Figure 9.2(a). Following that cell is a cell with the number 1.688044 and
the two are separated by a tab character which is denoted by \t. Each of the remaining
cells in that row are also separated by tabs. The row ends with a cell containing the value
of 0 and after that is the newline character

122
n. The number 883 in the last line in Code 9.1 begins the next row in the spreadsheet
which is shown in Figure 9.2(b). This is the nature of the tab delimited spreadsheet.
Each row in the spreadsheet can be isolated by the split command. Code 9.2 shows
the manner in which each row of data is separated. The output is a list name lines that
contains strings. Each string is a row of data from the spreadsheet. The cells in each row
can be separated by splitting on tab characters.

Code 9.2 Separating the rows.

1 >>> lines = data . split ( ' \ n ' )


2 >>> len ( lines )
3 3259

In this large spreadsheet there are three sections of data and the third section has
the raw data. This portion of is shown in Figure 9.3. The task demonstrated here is to
extract six columns: Number, Name, ch1 Intensity, ch1 Background, c2 Intensity and ch2
Background.

Figure 9.3: The portion of the spreadsheet at the beginning of the raw data.

The first step is to find the location of “Begin Data” in the original string. This
is done in line 1 of Code 9.3. Only the data after that is important to this application
and so in line 4 that portion of the spreadsheet data is split. The output, lines, is a
list of strings with each string being a row from the spreadsheet. The first item in this
list, lines[0], is row 1650 in Figure 9.3. The second row is the list of column names of
which only a few are shown in the figure. The rest of the lines in Code 9.3 find out which
columns are those of interest.
The final step is to collect the data. In this case, there are 1600 lines of data and
the real data starts in lines[2]. So the loop in line 2 of Code 9.4 is over those lines.
Line 3 splits the cells on tabs and line 4 extracts only those columns that are of interest.
It also converts strings to integers or floats as necessary. Each of these lists of data are
appended to the big list gsmvals. The first row is shown. From here the user can perform
the analysis on the data.

9.2 The csv Module

Python installations come with the csv module that has the ability to read files save in
the CSV format. The advantage of this module over the previous method is that it can

123
Code 9.3 Determining the columns.

1 >>> begin = data . find ( ' Begin Data ' )


2 >>> begin
3 258341
4 >>> lines = data [ begin :]. split ( ' \ n ' )
5 >>> row = lines [1]. split ( ' \ t ' )
6 >>> row . index ( ' Number ' )
7 0
8 >>> row . index ( ' Name ' )
9 5
10 >>> row . index ( ' ch1 Intensity ' )
11 8
12 >>> row . index ( ' ch1 Background ' )
13 9
14 >>> row . index ( ' ch2 Intensity ' )
15 20
16 >>> row . index ( ' ch2 Background ' )
17 21

Code 9.4 Gathering the data.

1 >>> gsmvals = []
2 >>> for li in lines [2:1602]:
3 temp = li . split ( ' \ t ' )
4 tlist = [ int ( temp [0]) , temp [5] , float ( temp [8]) , float (
temp [9]) , float ( temp [20]) , float ( temp [21]) ]
5 gsmvals . append ( tlist )
6 >>> gsmvals [0]
7 [1 , ' phos phodie steras e I / nucleotide pyrophosphatase 2 (
autotaxin ) ' , 3077.651611 , 1083.671875 , 1107.415527 ,
374.328125]

124
handle multiple formats in which the data is saved.
Code 9.5 shows the use of this module on the same file that was used in the previous
section. The file is opened in a normal manner as shown in line 2. Line 3 defines a new
variable as a csv reader. In this case the delimiter is clearly defined as the tab character.
Without that declaration, commas will also be treated as a delimiter and as there are
commas in some gene names this will cause incorrect reading of the data.

Code 9.5 Using the csv module.

1 >>> import csv


2 >>> fp = open ( fname )
3 >>> cr = csv . reader ( fp , delimiter = ' \ t ' )
4 >>> ldata = []
5 >>> for r in cr :
6 ldata . append ( r )
7 >>> for i in range ( len ( ldata ) ) :
8 if ' Begin Data ' in ldata [ i ]:
9 print ( i )
10 1649
11 >>> for ld in ldata [1 651:16 51+160 0]:
12 gsmvals . append ( [ int ( ld [0]) , ld [5] , float ( ld [8]) , float (
ld [9]) , float ( ld [20]) , float ( ld [21]) ] )

Line 4 creates an empty list that is populated in lines 5 and 6. These two lines
convert the data so that each row from the spreadsheet is a list of strings. Each cell is a
string in that list.
To replicate the extraction of data performed in the previous section the first task
is to find which row contains the phrase “Begin Data”. This is performed in lines 7 and 8
and the result indicates that this is in ldata[1649]. The last two lines extract the same
six columns of data as in the previous method.

9.3 xlrd

There are two modules that are reviewed here that have the ability to read an Excel file
directly. These modules have several functions, but only those necessary for reading a
file are shown here. It should be noted that the two previous methods could only read
the data of a single page, while these next two modules can also read all pages, formulas
and other entities in the spreadsheet. Neither of these modules comes with the native
version of Python and users may have to download and install them. They do, however,
are contained with packages such as Anaconda.
The first module is xlrd which can read the older style of Excel files that come with

125
the extension “.xls.” The example is shown in Code 9.6. The file is opened in line 2.
Lines 3 and 4 show the list of page names. In this case, the spreadsheet has only one page
named “Export”.

Code 9.6 Using the xlrd module.

1 >>> import xlrd


2 >>> wb = xlrd . open_workbook ( ' data / GSM151667 . xls ' )
3 >>> wb . sheet_names ()
4 [ ' Export ' ]
5 >>> sheet = wb . sheet_by_name ( ' Export ' )
6 >>> row = sheet . row (0)
7 >>> type ( row )
8 < class ' list ' >
9 >>> len ( row )
10 34
11 >>> row [0]. value
12 ' User Name '

Line 5 extracts the data from the specified page and line 6 shows the extraction of a
single row of data. This is a list and in this case this list has 34 items. There is one item
for each cell in that spreadsheet row. The last two lines show how to retrieve the content
of the first cell in the first row.
Code 9.7 shows the use of this module. Lines 1 and 2 indicate that the sheet has 3258
rows. Lines 3 through 7 find the one row with the string “Begin Data”. The rest of the
lines convert the data to a list for further processing. Note that numbers are automatically
converted to floats in this process.

9.4 Openpyxl

The openpyxl module offers routines to read the XLSX file format. Code 9.8 shows the
process of loading the file and getting the page names in lines 1 through 3. Access to the
cells is shown in lines 4 through 7. Use of the active sheet is shown in the last lines.
Code 9.9 shows that the variable ws.rows is just a tuple and that a single row such
as ws.rows[0] is also a tuple. Therefore, they can be accessed through numerical indexes
as shown in the final lines.

9.5 Summary

This chapter demonstrated four possible ways of accessing data contained in a spreadsheet
from Python. The first two required that the user save the spreadsheet information as a

126
Code 9.7 Converting the data.

1 >>> sheet . nrows


2 3258
3 >>> for i in range ( sheet . nrows ) :
4 row = sheet . row ( i )
5 if ' Begin Data ' == row [0]. value :
6 print ( i )
7 1649
8 >>> ldata = []
9 >>> for i in range ( 1651 ,1651+1600) :
10 row = sheet . row ( i )
11 t = []
12 for j in (0 ,5 ,8 ,9 ,20 ,21) :
13 t . append ( row [ j ]. value )
14 ldata . append ( t )

Code 9.8 Using openpyxl.

1 >>> wb = openpyxl . load_workbook ( ' data / GSM151667 . xlsx ' )


2 >>> wb . get_sheet_names ()
3 [ ' Export ' ]
4 >>> wb [ ' Export ' ][ ' A1 ' ]. value
5 ' User Name '
6 >>> wb [ ' Export ' ][ ' A2 ' ]. value
7 ' Computer '
8 >>> ws = wb . active
9 >>> ws [ ' A1 ' ]. value
10 ' User Name '

127
Code 9.9 Alternative usage.

1 >>> type ( ws . rows )


2 < class ' tuple ' >
3 >>> len ( ws . rows )
4 3258
5 >>> type ( ws . rows [0] )
6 < class ' tuple ' >
7 >>> len ( ws . rows [0] )
8 34
9 >>> ws . rows [0][0]
10 < Cell Export . A1 >
11 >>> ws . rows [0][0]. value
12 ' User Name '

CSV file and the last two read directly from the spreadsheet.

128
Chapter 10

Reading a Binary File

When it became possible to sequence parts of the genome several companies created se-
quencing machines. One such sequencer was made by ABI and it had the ability run a
few dozen experiments at one time. This machine produced a data file that had several
components and this chapter will explore the methods needed to read this file. All of the in-
formation about this file was obtained from the published ABI documentation.[ABI, 2016]

10.1 A Brief Overview of a Sequencer

In a single sample of DNA contained a large number of DNA strands, each starting at the
same location in the genome. However, the lengths were varied. One of four dyes were
attached to the end of the strands depending on the last base in the sample. Thus, if the
dye could be detected then it is possible to determine the last base in a strand.
The next step is to separate the strands by length. This was performed by sending
the strands through a gel. Longer strands encountered more resistance and therefore
traveled slower through the gel. The gel was kept between to plates of glass and oriented
vertically so that the sample went from the top of the gel to the bottom by the aid of an
electric potential and gravity. At the bottom of the gel was a laser that would illuminate
the dyes as they passed through and an optical detector that would receive the fluorescence
from dye. In these machines the gel was wide enough to run a few dozen samples at the
same time. Each set of samples ran down a lane and the laser could illuminate all of the
lanes.
The information for each lane was saved in a separate file. This file contains infor-
mation about the experiment as well as the data from the experiment. Since there are four
nucleotides in DNA there were four dyes and therefore each experiment had four channels.
One channel from one experiment is shown in Figure 10.1. The x axis corresponds to time
and each peak is the presence of this dye at a given time. There is also a bias as this
sample does not go to 0 when the dye is not present.

129
Figure 10.1: One channel from one lane.

This experiment also collected almost 8000 data points. Concurrently, it was col-
lecting data points for the other three channels. Figure 10.2 shows a small segment of
the experiment with all four channels. Dyes react to a range of optical frequencies and
therefore activity in one channel can also be seen in another. This occurs at locations were
two channels have a peak at the same time. Also noticeable is that each channel has its
own baseline and as seen in Figure 10.1 this baseline changes in time.

Figure 10.2: All four channels in a small segment.

A deconvolution process is applied to clean the data. The same segment of signal is
shown in Figure 10.3 after the deconvolution process was applied. As seen each channel
has a baseline at y = 0 and only one peak is available at any time.

130
Figure 10.3: The same signal after deconvolution.

The final step was to call the bases. In this case the red lines is associated with
G, green with A, blue with T and violet with C. Thus, in this segment that calls are
ACTATAGGGCGAATTCGAG.
The data files contain a lot of information, but the intent of this chapter is to
demonstrate how to read data files. Thus, extracting all of the information will not be
performed. Instead the only retrievals will be the raw and cleaned data as well as the base
calls. The rest of the information can be retrieved in manners very similar to those shown
in this chapter.

10.2 Hexadecimal

Before the data is extracted it is necessary to understand two forms of numerical represen-
tation. People use a base 10 system. A number greater than 10 requires two digits. One
digit uses the ones column and the other uses the tens column. Numbers greater than 100
use the hundreds column and so on.
While this system was logical for humans that mostly had ten fingers, it was not
well suited for computer use. Computers actually can only store information in a binary
format. Each bit of memory is represented by either a 0 or 1.
A byte of memory is eight bits and therefore can represent 256 different values.
A word is two bytes or 16 bits. A word can represent 65,536 different values. Modern
computers are 32 or 64 bits.
Decimal notation is cumbersome and easily prone to errors, so the hexadecimal
system is often employed. This is a base-16 system and the conversion is shown in Table
10.1. The digits 0 through 9 are the same in hexadecimal as in decimal and so they are

131
Table 10.1: Hexadecimal Values.

Hexadecimal Decimal
0 0
9 9
A 10
B 11
C 12
D 13
E 14
F 15

not all shown here. A 10 in decimal is represented as an A in hexadecimal.


A hexadecimal value can be represented by 4 bits, and thus two hexadecimal symbols
are used to represent a byte. The value 9A is computed as 9 × 16 + 10 = 154. To clarify
which representation is used a lower case letter may be used. Thus, 9Ah = 154d.
A 16 bit word is two bytes and can represent 65,536 values. Thus, a word could
range from 0 to 65,535. In some cases, one of the bits is used to represent the sign of
a number leaving 15 bits for the number. This type of word can represent values from
-32,768 to 32,767. In Python the first type of word is represented by uint16 indicating
that it is an unsigned 16 bit integer. The second type is int16 which is a signed 16 bit
integer. Python also has 8, 32 and 64 bit representations as well as various floats, but
since they are not used in this chapter they are not reviewed.
A word consists of 2 bytes and the next question is which byte is stored first in the
memory. Of course, the operating systems differ in this respect. MS-Windows machines
run on Intel chips (or similar) which uses the little endian format. So, the two bytes are
stored in reverse order. Whereas, Macs original used Motorola chips. UNIX and OSx use
big endian.
The concern here is that the ABI files use 16 bit integers and it is necessary to know
if they are stored as big or little endian. The files used here are stored as big endian
because the early ABI computers were controlled by Apple computers.

10.3 The ABI File Format

The ABI file is rather large and contains a plethora of information. Programs such as
hexdump can show the raw contents of a file easily. This program comes with UNIX
and OSx operating systems and is called from the command line with the command hd
filename. Hexdump programs for Windows are available but care should be used when
downloading executable programs from websites.
Figure 10.4 shows the beginning of the hexdump for an ABI gel file. The left column

132
is the location in the file represented in hexadecimal notation. So, line 1 starts at 00000000
and line 2 starts at 000000010 which is location 16d. The next segment of a line shows
sixteen bytes in hexadecimal notation. The last column is the ASCII notation for the file.
A computer can only store numerical values and there is an ASCII which associates each
letter with a numerical value. The last column in the display shows the ASCII equivalents
for each byte. Not all bytes have an associated character and so periods are used.

Figure 10.4: The beginning of the hexdump.

Like many file formats the first few bytes denote the type of file. In this case the
first four bytes are 41 42 49 46. These four bytes represent the characters ABIF from the
ASCII table.
Python offers two functions that can convert between characters and numerical
values. It also provides tools that convert between hex and decimal notations. These are
shown in Code 10.1. Line 1 uses the hex command to convert the decimal value 65 to its
hex equivalent which is 41h. Python represents a hex number with 0x. In line 3 the user
enters the hex value with this notation and the decimal value is returned. The first byte
in the file was 41h and line 5 uses the chr command to return the associated character
which is a capital ‘A’. The second byte in the file was 42h and that is returned as ‘B’. The
ord function finds the decimal value for a given letter.

Code 10.1 Using Python for character conversions.

1 >>> hex (65)


2 ' 0 x41 '
3 >>> 0 x41
4 65
5 >>> chr (0 x41 )
6 'A '
7 >>> chr (0 x42 )
8 'B '
9 >>> ord ( ' A ' )
10 65

The first four bytes in the file are ABIF and the next two bytes represent the version

133
Table 10.2: ABI Record.

Bits Type Description


4 char Name of the record
4 uint32 Data type
2 uint16 Element Type
2 uint16 Element Size
4 uint32 Number of elements
4 uint32 Data size
4 Unused Unused

identifier. This is an unsigned integer. Code 10.2 shows that this file used version 101.

Code 10.2 ABI version number.


1 >>> 0 x0065
2 101

There are many bytes filled with FF and the hexdump program will place an asterisk
in a line to show that this is just the same set of bytes in all of the missing rows.

10.3.1 ABI Records

This file relies on the use of a record which was defined by ABI. A record consists of 28
bytes in the format shown in table 10.2.
The first record starts just after the ABIF and version number which is location six.
Code 10.3 shows the steps to read the bytes for the first record. The file is opened in line
1 and the file pointer is moved to location 6 which is the beginning of the first record.
Line 4 reads the 28 bytes of the record but does not interpret them. The first four bytes
are the name of the record and are shown in line 6. The name of the record is “tdir” and
the ‘b’ that precedes them is the Python indication that this is a series of bytes.

Code 10.3 Reading the first record.

1 >>> fp = open ( ' data / abilane1 ' , ' rb ' )


2 >>> fp . seek (6)
3 6
4 >>> a = fp . read (28)
5 >>> a [:4]
6 b ' tdir '

Python has a module named struct which can conveniently convert bytes read from
a file to the desired format. This module has two main functions pack and unpack. The

134
latter is applied to the record in line 2 of Code 10.4. The second argument is a[4:] which
uses all of the bytes of the record except for the first 4. These have already been used to
return the name of the record. As shown in Table 10.2 the rest of the data is either 32 or
16 bit integers. The letter for an unsigned 16 bit integer (also call unsigned short) is ‘H’
and for an unsigned integers is ‘I’. Thus the string ‘IHHIIII’ interprets the data as shown
in Table 10.2. The symbol ‘>’ indicates that the data is big endian. The unpack function
has many symbols that can be used and the reader is encouraged to view these options in
the Python manual pages at https://docs.python.org/3.5/library/struct.html.

Code 10.4 Interpreting the bytes.

1 >>> import struct


2 >>> struct . unpack ( ' > IHHIIII ' , a [4:] )
3 (1 , 1023 , 28 , 56 , 1792 , 128947 , 19912500)
4 >>> hex (128947)
5 ' 0 x1f7b3 '

The unpack function returns 7 numbers accordingly. The information that is im-
portant here is that there are 56 records and the starting location in the file of these
records is 128,947d which is also 01 F7 B3 in hex.
Figure 10.5 shows this location in the hexdump. The second row begins with 00 01
F7 B0 and so the starting location is the third byte in from the left. As seen in the right
column this corresponds to the record named AUTO. Every 28 bytes there is a new record
and some of their names are visible in this figure.

Figure 10.5: The hexdump including the location 01 F7 B3.

There are 56 records in this file and only a few are of interest here. There are 12
records named DATA. The first four are the four raw data channels as shown in Figures
10.1 and 10.2. The next four contain the information used in the deconvolution process
and the last four contain the cleaned data such as shown in Figure 10.3. The other record
of interest is named PBAS which contains the base calls.

135
10.3.2 Extracting the Records

Code 10.5 shows the ReadRecord function function which reads and interprets a single
record following the previous prescription. This function is called 56 times for this file
gathering information from all the records.

Code 10.5 The ReadRecord function.


1 # abigel . py
2 def ReadRecord ( fp , loc ) :
3 fp . seek ( loc )
4 a = fp . read ( 28 )
5 name = a [:4]
6 b = struct . unpack ( ' > ihhiiii ' , a [4:] )
7 number , elemtype , elemsize , numelem , datasize , dataloc ,
mystery = b
8 return name , number , elemtype , elemsize , numelem ,
datasize , dataloc , mystery
9

10 >>> recs = []
11 >>> k = 128947
12 >>> for i in range ( 56 ) :
13 recs . append ( abigel . ReadRecord ( fp , k ) )
14 k += 28
15 >>> recs [35]
16 ( b ' PBAS ' , 2 , 2 , 1 , 576 , 576 , 128317 , 19912464)

10.3.2.1 The Base Calls

The records are in alphabetical order and recs[35] is the PBAS record. This record
indicates that the data starts at location 128317 and that there are 576 bytes of data.
Code 10.6 shows the movement of the file pointer and line 3 reads the ensuing 576 bytes.
These are bases as called by the ABI software. The ‘N’ letters indicate that a base exists
but there was not enough information to probably call the base.

10.3.2.2 The Data

The first two records for the data is shown in Code 10.7. The first number indicates the
record number, so these are 1 and 2. The second value is 4 and this indicates that the
data is a signed 16 bit integer (see the ABI manual starting on page 13). The next value
is 2 which indicates that a 16 bit integer is 2 bytes. The third number is 7754 which is
the number of data samples and since each sample is 2 bytes the total number of bytes

136
Code 10.6 The base calls.
1 >>> fp . seek ( 128317)
2 128317
3 >>> calls = fp . read ( 576 )
4 >>> calls
5 b' TNNGAATTGCATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCTC
6 TAGAGTCGACCTGCAGGCATGCAAGCTTGAGTATTCTATAGTGTCACCTAAATAGCTTGG
7 CGTAATCATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACA
8 ACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCA
9 CATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGC
10 TTAATGAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGCTTC
11 CTCGCTCACTGACTCGCTGNGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCACTC
12 AAAGGCGGGTAATACGGGTTATCCACAGGAATCAGGGGATAACGCAGGAAAGACATGTGA
13 GCAAAAGGGCAGCAAAAGGGCAGGAACCCTAAAAAGGCCGCGTTGGTGGGNTTTTCCATA
14 GGGTCCCCCCCCTGANGAGATAAAAAANCGAGGTCAC '

is 15508. The next numbers are the locations of the data. So, the first channel starts at
location 453 in the file and the second channel starts at location 15961.

Code 10.7 The first data record.


1 >>> recs [5]
2 ( b ' DATA ' , 1 , 4 , 2 , 7754 , 15508 , 453 , 0)
3 >>> recs [6]
4 ( b ' DATA ' , 2 , 4 , 2 , 7754 , 15508 , 15961 , 0)

Code 10.8 shows that the file pointer is moved to the location 453. Line 3 reads in
the bytes and line 4 converts them all to big endian, signed 16 bit integers. The use of
‘7754h’ indicates that there are 7754 signed 16 bit integers to be decoded. The last lines
show the first ten values which are the first ten values in the plot in Figure 10.1.

Code 10.8 Retrieving the first channel.

1 >>> fp . seek ( 453 )


2 453
3 >>> a = fp . read ( 15508 )
4 >>> data = struct . unpack ( ' >7754 h ' , a )
5 >>> data [:10]
6 (878 , 880 , 877 , 878 , 876 , 871 , 874 , 873 , 877 , 875)

The data can be saved use the function Save from the gnu module. Then either
GnuPlot or a spreadsheet can plot the data.

137
10.3.3 Cohesive Program

Already the ReadRecord function has been presented but other functions are needed to
automate the reading of the Gel file. Code 10.9 shows the ReadPBAS functoin function
which is used to search the records for one named PBAS and then extracting the called
bases. The inputs are the file pointer and the records which are read by ReadRecord.

Code 10.9 The ReadPBAS function.


1 # abigel . py
2 def ReadPBAS ( fp , recs ) :
3 for r in recs :
4 if r [0]== b ' PBAS ' :
5 fp . seek ( r [6] )
6 calls = fp . read ( r [5])
7 break
8 return calls

The ReadData function shown in Code 10.10 reads the data that were shown in
plotting. The inputs are the file pointer and the previously read records.

Code 10.10 The ReadData function.


1 # abigel . py
2 def ReadData ( fp , recs ) :
3 data = []
4 k = 0
5 for r in recs :
6 if r [0] == b ' DATA ' :
7 if k <=3 or k >= 8:
8 print ( r )
9 fp . seek ( r [6] )
10 a = fp . read ( r [5] )
11 g = ' > ' + str ( int ( len ( a ) /2) ) + ' H '
12 b = np . array ( struct . unpack (g , a ) )
13 data . append ( b )
14 k += 1
15 return data

There are 12 entries named DATA and the first four and last four are the desired
arrays of data. Thus, line 4 creates an integer k. It is incremented after each iteration in
line 14. The data is extracted only if k is less than 4 or greater than 7 as in line 7. Line 9
moves the file pointer and line 8 retrieves the data. The first four data channels have the
same length, but this length is different than the length of the last four channels. Thus,

138
the instruction to struct.unpack must be created. The variable g in line 11 is a string
that is the instruction for unpack. Then line 12 executes that command converting the
data to signed integers. The list data contains four channels of raw data and four channels
of cleaned data.
The SaveData shown in Code 10.11 saves the channels in eight different files. The
input is the data returned from the previous function and the pname is the partial file
name which has a dummy default value. In this case the files will be stored as dud0.txt,
dud1.txt etc. The file name is created in line 4 and the Save function from the gnu module
is called to save the file in a text format that is readable by GnuPlot or a spreadsheet.

Code 10.11 The SaveData function.


1 # abigel . py
2 def SaveData ( data , pname = ' dud ' ) :
3 for i in range ( len ( data ) ) :
4 fname = pname + str ( i ) + ' . txt '
5 gnu . Save ( fname , data [ i ] )

The final function is named Driver which performs all of the tasks. Line 3 opens
the file for reading binary data and the first record is read in line 4. This indicates the
location of the other records, loc, and the number of records, nrec. These are read in
line 9 and appended into a list in line 10. Then the functions that read the calls and data
are accessed. The final step is to save the data to the disc for viewing and to return the
data to the user.

Code 10.12 The Driver function.


1 # abigel . py
2 def Driver ( fname ) :
3 fp = open ( fname , ' rb ' )
4 t = ReadRecord ( fp ,6 )
5 loc = t [6]
6 nrec = t [4]
7 recs = []
8 for i in range ( nrec ) :
9 t1 = ReadRecord ( fp , loc + i *28 )
10 recs . append ( t1 )
11 calls = ReadPBAS ( fp , recs )
12 data = ReadData ( fp , recs )
13 SaveData ( data )
14 return recs , calls

139
Problems

1. Extract the names of all of the records.

2. Show that there are no bytes between the end of the raw data for the first channel
and the beginning of the data for the second channel.

3. Create a plot of the clean data from position 2000 to 2200.

140
Chapter 11

Python Arrays

The original Python has several powerful packages but was missing the ability to efficiently
handle vectors, matrices and tensors. Two third party packages, numpy and scipy, offer
these abilities along with an extensive scientific library. This chapter will review some
of the basics but will come woefully short of being an extensive library of the available
functions.
Python uses the word array to be a collection of similar data types. This includes
a vectors, matrices and tensors. This text, though, may delineate these mathematical
entities even though Python simply refers to them as arrays.

11.1 Vector

A vector is a 1D array of numbers of the same type. As an example, the vector ~v is an


array of integers,
~v = (1, 2, 4, 1, 4, 1, 1).
To create a vector in Python the numpy module is imported. The common practice is
shown in Code 11.1 in line 1 where the module is renamed to np. Line 2 uses the zeros
command to a vector with 5 floats and the value of each is 0 as shown in line 4.

Code 11.1 Creating a vector of zeros.

1 >>> import numpy as np


2 >>> vec = np . zeros ( 5 )
3 >>> vec
4 array ([ 0. , 0. , 0. , 0. , 0.])

There are three other methods by which vectors can be created. Line 1 in Code
11.2 uses the ones command to create a vector where all of the values are 1 instead of 0.

141
Line 4 uses the array command to create a vector from user defined data. Line 7 uses
the random.rand command to create a vector of random numbers with values between
0 and 1.

Code 11.2 Creating other types of vectors.

1 >>> vec = np . ones ( 5 )


2 >>> vec
3 array ([ 1. , 1. , 1. , 1. , 1.])
4 >>> vec = np . array ( (4 ,4 ,1 ,6) )
5 >>> vec
6 array ([4 , 4 , 1 , 6])
7 >>> vec = np . random . rand ( 3 )
8 >>> vec
9 array ([ 0.03332802 , 0.65907101 , 0.95803202])

Often the precision of the arrays that are printed to the console are too long
and so the print precision can be controlled as shown in Code 11.3. This uses the
set printoptions function to set the nature of the output. This only affects the printing
of the values and not the precision used in computations.

Code 11.3 Setting the printing precision

1 >>> np . set_printoptions ( precision = 3 )


2 >>> vec
3 array ([ 0.033 , 0.659 , 0.958])

11.2 Matrix

A matrix is a 2D array of numerical values. An example of a 2 × 2 matrix is,


 
1 2
M= .
3 4

The same functions that create a vector can be used to create a matrix. The zeros
function is shown in Code 11.4. In this case the argument to the zeros function is the
tuple (2,3). This defines the vertical and horizontal dimension of the matrix.
There is a difference in generating a random matrix in that the ranf function is used
instead of the rand function. This is shown in Code 11.5.

142
Code 11.4 Creating a matrix.

1 >>> M = np . zeros ( (2 ,3 ) )
2 >>> M
3 array ([[ 0. , 0. , 0.] ,
4 [ 0. , 0. , 0.]])

Code 11.5 Creating a matrix of random values.

1 >>> M = np . random . ranf ( (2 ,3) )


2 >>> M
3 array ([[ 0.189 , 0.736 , 0.668] ,
4 [ 0.316 , 0.449 , 0.497]])

11.3 Slicing

Slicing of a vector behaves in the same way as does slicing for tuples, lists and strings.
Slicing for a matrix is different since there are multiple dimensions. Line 1 in Code 11.6
extracts the value from the first row and the first column. Line 3 extracts all of the values
from the first row. Line 5 extracts al of the values from the second column.

Code 11.6 Extracting elements.

1 >>> M [0 ,0]
2 0.189 48 1 7 13 7 9 57 5 4 85
3 >>> M [0]
4 array ([ 0.189 , 0.736 , 0.668])
5 >>> M [: ,1]
6 array ([ 0.736 , 0.449])

Code 11.7 gets a sub-matrix from the original. In this example, the command
extracts the rectangle that includes rows 1 & 2 and columns 2 & 3.
The nonzero function returns locations of values that are not zero in an array. In
the example in Code 11.8 line 3 compares each value in the vector to 0.5. The answer
show in line 4 places a True or False in the elements were the condition was met or failed.
In line 5 the nonzero function is added. This will return the positions in which the True
value was returned. For vectors it is necessary to put the [0] at the end of the function.
The answer in line 6 is a vector that indicates that the result from line 5 was True in
positions 1 and 2.
The return from the nonzero function is a vector and that can be used to slice an
array. Line 1 in Code 11.9 is the same as line 6 in Code 11.15 except that the answer is
returned as a variable x. This x is a vector. In line 2 this vector is used as the index for

143
Code 11.7 Extracting a sub-matrix.

1 >>> P = np . random . ranf ( (5 ,5) )


2 >>> P
3 array ([[ 0.553 , 0.833 , 0.802 , 0.857 , 0.646] ,
4 [ 0.365 , 0.045 , 0.539 , 0.849 , 0.746] ,
5 [ 0.78 , 0.277 , 0.567 , 0.345 , 0.449] ,
6 [ 0.631 , 0.6 , 0.952 , 0.741 , 0.006] ,
7 [ 0.251 , 0.647 , 0.922 , 0.77 , 0.231]])
8 >>> P [1:3 ,2:4]
9 array ([[ 0.539 , 0.849] ,
10 [ 0.567 , 0.345]])

Code 11.8 Extracting qualifying indexes.

1 >>> vec
2 array ([ 0.033 , 0.659 , 0.958])
3 >>> vec > 0.5
4 array ([ False , True , True ] , dtype = bool )
5 >>> ( vec > 0.5) . nonzero () [0]
6 array ([1 , 2] , dtype = int64 )

the original data vector. The result is that the this command returns the data that was
at positions 1 and 2.

Code 11.9 Extracting qualifying elements.

1 >>> x = ( vec > 0.5) . nonzero () [0]


2 >>> vec [ x ]
3 array ([ 0.659 , 0.958])

This feature is called random slicing because it has the ability to extract the elements
from an array in any specified order. Consider the first 6 lines in Code 11.10. Each one
extracts from of the elements from the array named P. Lines 7 and 8 create lists which are
the coordinates that were used in line 1, 3 and 5. Line 9 uses those coordinates to pull
out the same three values in a single command.

11.4 Mathematics and Some Functions

The advantage of arrays is the speed in which the computations can be performed. Con-
sider the simple task of adding the values of two matrices to create a third matrix. Lines
1 and 2 in Code 11.11 create 2 matrices. Lines 3 through 6 show the method of adding

144
Code 11.10 Modifying the matrix.

1 >>> P [0 ,1]
2 0.833 06 1 8 62 3 8 23 6 7 24
3 >>> P [1 ,1]
4 0.044 9 2 9 9 8 1 1 2 0 3 1 1 1 0 2
5 >>> P [3 ,0]
6 0.631 36 4 7 38 3 1 27 5 3 42
7 >>> v = [0 ,1 ,3]
8 >>> h = [1 ,1 ,0]
9 >>> P [v , h ]
10 array ([ 0.833 , 0.045 , 0.631])

the two matrices together. The only problem with this approach is that it is slow. Line
8, on the other hand, is simpler to write and has a much faster execution time.

Code 11.11 Adding two matrices.

1 >>> m1 = np . random . ranf ( (4 ,5) )


2 >>> m2 = np . random . ranf ( (4 ,5) )
3 >>> m3 = np . zeros ( (4 ,5) )
4 >>> for i in range ( 4 ) :
5 for j in range ( 5 ) :
6 m3 [i , j ] = m1 [i , j ] + m2 [i , j ]
7

8 >>> m3 = m1 + m2

Code 11.12 creates two vectors in lines 1 and 2. Line 7 shows that in a simple
command that several additions are performed.
Without arrays, the Python programmer would be required to perform this addition
with a for loop. The for loop does exist in line 7 but it is in the Fortran code that is
called when two arrays are added together. Of course, vectors can be subtracted and
multiplied as shown in Code 11.13.
The multiplication shown in line 4 is an elementary multiplication, meaning that each
element is multiplied by the respective element in the other vector. There are actually
four ways that two vectors can be multiplied together. The others are dot product, outer
product and cross product.
The dot product is also called the inner product and the answer is a single scalar
value. The notation is,
v = ~a · ~b.

The Python script that performs this operation is shown in line 1 of Code 11.14.

145
Code 11.12 Addition of arrays.

1 >>> a = np . random . rand (3)


2 >>> b = np . random . rand (3)
3 >>> a
4 array ([ 0.677 , 0.671 , 0.939])
5 >>> b
6 array ([ 0.642 , 0.168 , 0.292])
7 >>> c = a + b
8 >>> c
9 array ([ 1.319 , 0.839 , 1.231])

Code 11.13 Elemental subtraction and multiplication.

1 >>> c = a -b
2 >>> c
3 array ([ 0.036 , 0.503 , 0.648])
4 >>> c = a * b
5 >>> c
6 array ([ 0.435 , 0.113 , 0.274])

Numpy also provides a function named dot that performs the same computation. Actually,
this function can also compute the outer product, the matrix-vector product, and the
vector-matrix product.

Code 11.14 Dot product.

1 >>> ( a * b ) . sum ()
2 0.821079 8 2 33 8 3 93 9 9 2
3 >>> a . dot ( b )
4 0.821079 8 2 33 8 3 93 9 9 2

The matrix-vector product is,


~v = M~b, (11.1)
which produces a vector as the output (Code 11.15). The dot function knows that this is
the operation to perform based on the dimensions of the inputs. If the arguments to the
dot function are both vectors then it performs the dot product. If the inputs are a matrix
and vector then it performs the matrix-vector multiply.
There are many functions that are applied to matrices and numpy has functions for
them. The transpose is shown in Code 11.16.
The inverse of a matrix is a much more complicated computation but is also very
useful. The call to the inverse function is shown in Code 11.17. Line 1 creates a square

146
Code 11.15 Matrix dot product.

1 >>> M = np . random . ranf ( (3 ,4) )


2 >>> v = np . random . rand ( 4 )
3 >>> M . dot ( v )
4 array ([ 0.787 , 1.609 , 1.044])

Code 11.16 Transpose and inverse.

1 >>> M
2 array ([[ 0.671 , 0.058 , 0.095 , 0.359] ,
3 [ 0.287 , 0.644 , 0.793 , 0.962] ,
4 [ 0.501 , 0.279 , 0.294 , 0.557]])
5 >>> M . transpose ()
6 array ([[ 0.671 , 0.287 , 0.501] ,
7 [ 0.058 , 0.644 , 0.279] ,
8 [ 0.095 , 0.793 , 0.294] ,
9 [ 0.359 , 0.962 , 0.557]])
10 >>> M . T
11 array ([[ 0.671 , 0.287 , 0.501] ,
12 [ 0.058 , 0.644 , 0.279] ,
13 [ 0.095 , 0.793 , 0.294] ,
14 [ 0.359 , 0.962 , 0.557]])

147
matrix (same dimension in the horizontal and vertical) and line 2 computes the inverse of
that matrix. The matrix-matrix multiplication of a matrix with its inverse produces the
identity matrix which has ones down the diagonal and zeros everywhere else.

Code 11.17 Matrix inversion.


1 >>> M = np . random . ranf ( (3 ,3) )
2 >>> Minv = np . linalg . inv ( M )
3 >>> M . dot ( Minv )
4 array ([[ 1.000 e +00 , 0.000 e +00 , 0.000 e +00] ,
5 [ 0.000 e +00 , 1.000 e +00 , 0.000 e +00] ,
6 [ 0.000 e +00 , - 3.553 e-15 , 1.000 e +00]])

There are also a large set of standard math functions that can be applied to a matrix.
In all cases these are applied to each element of the array. Examples are shown in Code
11.18. There are many more functions than shown in the code.

Code 11.18 Some functions.


1 >>> a = np . sqrt ( Q )
2 >>> b = np . sin ( Q )
3 >>> c = np . log ( Q )

11.5 Information

There are several functions that extract information from an array. Some of these are
shown in Code 11.19. Line 1 computes the sum over the vector, line 3 computes the
average of the vector values and line 5 computes the standard deviation values.

Code 11.19 Retrieving information.

1 >>> vec . sum ()


2 1.6504310 486183 42
3 >>> vec . mean ()
4 0.5501 43 68 28 72 78 07
5 >>> vec . std ()
6 0.385286 2 4 82 9 5 33 4 8 1

There are also functions to compute the max, min and mode.
A matrix has the same functions but there are choices that are available. For
example, the sum function can be used to compute the sum of all of the elements in the
matrix as shown in line 1 in Code 11.20. It is also possible to compute the sum of the

148
columns as shown in line 3. Line 5 sums across the rows. The argument to the sum
function is the axis of the array. For a matrix the first axis is the vertical dimension and
the second axis is the horizontal dimension and thus they are assigned the values 0 and 1.
This logic applies to all of the functions that are shown in Code 11.19.

Code 11.20 Varieties of summation.


1 >>> M . sum ()
2 2.85 52 26 14 38 89 89 65
3 >>> M . sum (0)
4 array ([ 0.505 , 1.185 , 1.165])
5 >>> M . sum (1)
6 array ([ 1.593 , 1.262])

The max function returns the maximum value in an array but it does not indicate
where the maximum value is. The argmax function is used to get that information.
Consider the example in Code 11.21 where line 1 creates a vector of random numbers.
Line 4 returns the maximum value. Note that line 5 shows more precision than line 3
because the set printoptions command applies to arrays whereas line 5 is just a float.
The argmax function is applied in line 6 and this indicates that the maximum value is
at location 4. Line 8 prints this element.

Code 11.21 Finding the maximum value.

1 >>> w = np . random . rand ( 5 )


2 >>> w
3 array ([ 0.596 , 0.378 , 0.8 , 0.823 , 0.952])
4 >>> w . max ()
5 0.951 88 7 4 02 7 2 39 1 6 81
6 >>> w . argmax ()
7 4
8 >>> w [4]
9 0.951 88 7 4 02 7 2 39 1 6 81

There is also an argmin and an argsort function. The argmin function behaves
just as the argmax function except that it seeks the minimum value. The argsort
function returns the sort order of the data as seen in Code 11.22. This result indicates
that the lowest value is w[1], the next lowest value is at w[0], and so on. The highest
value is at w[4].
The argsort function for a matrix requires a bit of decoding. It returns a single
value as shown in line 5 of Code 11.23. This value is the cell position in the matrix and
can be decoded to reveal the row and column position of the max. The row number is the
division of the argmax value by the number of columns. In this case 5 (the number of
columns) goes into 6 (the argmax) 1 time. Thus, the max is on row 1 (the second row).

149
Code 11.22 Using argsort.

1 >>> w . argsort ()
2 array ([1 , 0 , 2 , 3 , 4] , dtype = int64 )
3 >>> w [1]
4 0.377505 5 5 67 4 1 91 9 6 6
5 >>> w [2]
6 0.799978 9 9 61 2 3 10 5 9 7

The remainder of this division (6 ÷ 5) is also 1, so the location of the max is in column
1. In this case, the max is at Q[1,1]. Both the division and remainder can be computed
by the divmod function as shown in line 6. This one command returns both the division
and the remainder which are also the vertical and horizontal position of the max.

Code 11.23 Using divmod.

1 >>> Q = np . random . ranf ( (5 ,5) )


2 >>> Q . max ()
3 0.9400 81 00 79 99 25 32
4 >>> Q . argmax ()
5 6
6 >>> divmod ( Q . argmax () , 5 )
7 (1 , 1)
8 >>> Q [1 ,1]
9 0.9400 81 00 79 99 25 32

11.6 Example: Extract Random Numbers Above a Thresh-


old

The task is to gather all of the random numbers that are above the value of 0.5. Line 1 in
Code 11.24 performs half of the work. The array P is compared to a value of 0.5. All of
the elements that pass that test are set to True and the nonzero function extracts their
positions. Lines 2 through 7 shows these positions. Line 8 performs the other half of the
work. The positions which were stored in v and h are used as indexes and the values at
those locations are captured by the variable vals. The numbers in the vector vals are all
of the numbers in P that were greater than 0.5.

150
Code 11.24 Extracting qualifying values.

1 >>> v , h = ( P > 0.5 ) . nonzero ()


2 >>> v
3 array ([0 , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 2 , 2 , 3 , 3 , 3 , 3 , 4 , 4 , 4] ,
4 dtype = int64 )
5 >>> h
6 array ([0 , 1 , 2 , 3 , 4 , 2 , 3 , 4 , 0 , 2 , 0 , 1 , 2 , 3 , 1 , 2 , 3] ,
7 dtype = int64 ) )
8 >>> vals = P [v , h ]
9 >>> vals
10 array ([ 0.553 , 0.833 , 0.802 , 0.857 , 0.646 , 0.539 ,
0.849 , 0.746 ,
11 0.78 , 0.567 , 0.631 , 0.6 , 0.952 , 0.741 ,
0.647 , 0.922 ,
12 0.77 ])

11.7 Indices

Consider a case where the task is to exam pixels that surround a face in an image. There
are several steps required to accomplish this task. First the image is converted to a matrix
(next chapter) and then a face-finding algorithm is applied. The face-finding algorithm is
not perfect and will have false positives and therefore it is necessary to analyze the pixels
that surround the suspected face. This chapter considers the problem of extracting just
those pixels as shown by the circle in Figure 11.1.
The indices function creates two matrices as shown in in Code 11.25. One of the
matrices has increasing values down the rows and the other has increasing values across
the columns.
This is an extremely useful function that can be the foundation of isolating elements
in a matrix. Consider Code 11.26 which subtracts an integer from each matrix. This shifts
the row and column that contain 0 to new locations. The values in the first matrix are
the distances from the 0 row and the values in the second matrix are the distances to the
0 column.
Recall that the equation to compute a linear distance is,
p
d = x2 + y 2 . (11.2)

There is a single element in the two matrices that have both a 0 in the column and
the row. This is the defined center and the distance from that point to any other point is
computed by the Euclidean distance equation above.
All of these distances can be computed in a single command as show in line 1 of
Code 11.27. The output d is a matrix and each element contains the distance to the center.

151
Figure 11.1: Isolating the pixels about the face.

Code 11.25 Using the indices function.

1 >>> a , b = np . indices ( P . shape )


2 >>> a
3 array ([[0 , 0 , 0 , 0 , 0] ,
4 [1 , 1 , 1 , 1 , 1] ,
5 [2 , 2 , 2 , 2 , 2] ,
6 [3 , 3 , 3 , 3 , 3] ,
7 [4 , 4 , 4 , 4 , 4]])
8 >>> b
9 array ([[0 , 1 , 2 , 3 , 4] ,
10 [0 , 1 , 2 , 3 , 4] ,
11 [0 , 1 , 2 , 3 , 4] ,
12 [0 , 1 , 2 , 3 , 4] ,
13 [0 , 1 , 2 , 3 , 4]])

152
Code 11.26 Shifting the arrays.

1 >>> a-2
2 array ([[-2 , -2 , -2 , -2 , -2] ,
3 [-1 , -1 , -1 , -1 , -1] ,
4 [ 0, 0 , 0 , 0 , 0] ,
5 [ 1, 1 , 1 , 1 , 1] ,
6 [ 2, 2 , 2 , 2 , 2]])
7 >>> b-2
8 array ([[-2 , -1 , 0, 1, 2] ,
9 [-2 , -1 , 0, 1, 2] ,
10 [-2 , -1 , 0, 1, 2] ,
11 [-2 , -1 , 0, 1, 2] ,
12 [-2 , -1 , 0, 1, 2]])

Code 11.27 The distances.


1 >>> d = np . sqrt ( ( a-2) **2 + ( b-2) **2 )
2 >>> d
3 array ([[ 2.828 , 2.236 , 2. , 2.236 , 2.828] ,
4 [ 2.236 , 1.414 , 1. , 1.414 , 2.236] ,
5 [ 2. , 1. , 0. , 1. , 2. ],
6 [ 2.236 , 1.414 , 1. , 1.414 , 2.236] ,
7 [ 2.828 , 2.236 , 2. , 2.236 , 2.828]])

153
The purpose of this code is to compute the average of the values that are within a
distance of 10 from a defined central point.
This example uses a very small matrix, but now consider a much larger matrix that
goes through the same process. In the next chapter, images will be loaded and the pixel
values will be converted to a very large matrix. As an example, the programmer wants to
gather all pixels within a specified distance to a defined point. Defining those points can
be done by the method that was just described.
A smaller version is shown in Code 11.28 where the input data is created in line 1.
The desire is to define the central point at (50,40) which is not the center of the matrix.
However, it is from this point that we wish to gather all of the elements that are within a
distance of 10. Lines 2 through 4 create the two matrices that will be used to calculate the
distances as in line 5. The matrix dist contains distances from each element to the defined
point (50,40). Any pixel that is a distance less than 10 is one that is to be gathered. The
matrix d computed in line 6 has elements that are True if the distance to the central point
is less than 10.

Code 11.28 The average of an area.

1 >>> data = np . random . ranf ( (100 ,100) )


2 >>> a , b = np . indices ( data . shape )
3 >>> a = a-50
4 >>> b = b-40
5 >>> dist = np . sqrt ( a **2 + b **2)
6 >>> d = dist < 10
7 >>> v , h = d . nonzero ()
8 >>> avg = data [v , h ]. mean ()
9 >>> avg
10 0.512093 25 81 35 68 94

Line 7 collects the coordinates of these points and line 8 collects the values of the
pixels and computes the average.

154
11.8 Example: Simultaneous Equations

Consider the case of two equations with two unknowns.

3.1x + 2.8y = −1

and
1.2x − 0.9y = 3.

The task is to find the values of x and y that satisfy both equations. This can be
solved by a matrix inverse. The matrix are the coefficients (numerical values on the left),
    
3.1 2.8 x −1
= .
1.2 −0.9 y 3

Represent the matrix as M and the equation becomes,


   
x −1
M = .
y 3

The unknowns are the x and y and so the task is to isolate them from all other
components. This is accomplished by left-multiplying both sides of the equation by the
inverse of M. Then the equation becomes,
   
x −1 −1
=M .
y 3

Thus, the solution to x and y can be obtained by computing the inverse of the matrix
and then performing a matrix-vector multiply. The result is a vector and those elements
are x and y. The solution is shown in Code 11.29. The values of x and y are 1.22 and
-1.707. Line 9 checks the result using 3.1x + 2.8y = −1.

Code 11.29 Solving simultaneous equations.

1 >>> M = np . array ( ((3.1 ,2.8) ,(1.2 ,-0.9) ) )


2 >>> M
3 array ([[ 3.1 , 2.8] ,
4 [ 1.2 , -0.9]])
5 >>> Minv = np . linalg . inv ( M )
6 >>> c = np . array ((-1. , 3) )
7 >>> Minv . dot ( c )
8 array ([ 1.22 , -1.707])
9 >>> 3.1 * 1.22 + 2.8*(-1.707)-
10 0.9976

155
Figure 11.2: The electric circuit.

This has very practical uses. Consider the electronic circuit shown in Figure 11.2.
The problem gives the values for the resistors and the voltages. The task is to solve for
the current that goes through each resistor.
The solution for this follows Kirchhoffs laws which produces three equations,

I1 + I2 − I3 = 0

−I1 R1 + I2 R2 = −E1 + E2
and
I2 R2 + I3 R3 = E2 .
Here there are three equations and three unknowns (the currents I). Thus, a 3 × 3 matrix
is constructed from the coefficients.

Problems

1. Create a vector of 1000 random numbers. Compute the average of the square of this
vector.

2. Create a 5 × 5 matrix of random numbers. Compute the inverse of the matrix. Show
that the multiplication of the inverse with the original is the identity matrix.

3. Create a vector of 1000 elements ranging from 0 to π. Compute the average of the
cosine of these values. This should be performed in two lines of Python script.

4. Create a 5 × 4 matrix of random values from ranging from -1 to +1. Compute the
sum of the rows.

5. Create a 100 × 100 matrix using a random seed of your choice. Using divmod find
the location of the maximum value. Print the random seed, the location of the max
and the value of the max.

156
6. Given 1.63x − 0.43y = 0.91 and 0.64x + 0.87y = 0.19. Write a Python script the
method of simultaneous equations to determine the values of x and y.

157
158
Chapter 12

Python Functions and Modules

A function is used to contain steps that are used repeatedly. Instead of writing each
individual line of code, the user only needs to call on the function. Several functions
have already been used, but this chapter will demonstrate how functions and modules are
created.

12.1 Functions

There are several parts to a function:

ˆ Name of the function


ˆ (optional) Arguments (or inputs) to the function
ˆ (optional) Help comments
ˆ Commands
ˆ (optional) Return statement

12.1.1 Basic Function

Code 12.1 shows a bare-bones function. Line 1 uses the def keyword to declare the
creation of a function. The name of the function in this case is MyFunction. It does not
receive any inputs (empty parenthesis) and the declaration is followed by a colon. Line
2 is indented and it is the first command inside of the function. Line 3 is also indented
and therefore it is also a command inside the function. In most editors simply typing two
Returns will end the indentation and thus end the creation of the function.
The function has been created but has not been executed. Line 5 is at the command
prompt in the Python shell and the function is now called. Lines 6 and 7 show that the
commands inside of the function are executed. It is required to have the parenthesis after
the call to the function as shown in Line 5. If these are omitted then Python will return

159
information about the function but will not run its commands.

Code 12.1 A basic function.


1 def MyFunction () :
2 print ( ' Inside ' )
3 print ( ' the function ' )
4

5 >>> MyFunction ()
6 Inside
7 the function

12.1.2 Local and Global Variables

Consider Code 12.2 which defines a variable inside the function in Line 2. The function
is called in Line 4 and in Line 5 there is an attempt to access the variable ab. However,
an error is created. The variable ab is a local variable since it is declared inside of the
function. It only exists inside of the function and is not accessible outside of the function.

Code 12.2 Attempting to access a local variable outside of the function.

1 def Fun7 () :
2 ab = 9
3

4 >>> Fun7 ()
5 >>> ab
6

7 Traceback ( most recent call last ) :


8 File " < pyshell #89 > " , line 1 , in < module >
9 ab
10 NameError : name ' ab ' is not defined

A global variable is defined in Line 1 of Code 12.3. This is defined outside of the
function and is visible inside of the function (Line 3) as well as in the Python shell.
It is possible to declare a global variable inside of a function as shown in Code 12.4.
Line 2 uses the global function to create the global variable abc. It is available outside of
the function as shown in Line 6. The global function must be the first command inside
of the function.

160
Code 12.3 Executing a function.

1 >>> b = 9
2 >>> def Fun8 () :
3 print 7 + b
4

5 >>> Fun8 ()
6 16

Code 12.4 Using the global command.

1 def Fun9 () :
2 global abc
3 abc = 10
4

5 >>> Fun9 ()
6 >>> abc
7 10

12.1.3 Arguments

Inputs to a function are called arguments. Code 12.5 shows a new function which receives
a single input, which in this case is the variable ab. The function is called in Line 4 and
this time the user is required to give the function an argument. The variable ab becomes
the integer 5. The function is called again in Line 6 and this time the argument ab is the
string “hi there”.

Code 12.5 Using an argument.

1 def Fun1 ( ab ) :
2 print ( ab )
3

4 >>> Fun1 (5)


5 5
6 >>> Fun1 ( ' hi there ' )
7 hi there

Some languages are strictly typed which has several restrictions including the dec-
laration of a variable type when used as an argument to a function. In creating like this
in a language like C++ of Java the programmer would be required to declare the data
type for ab. If it is declared as an integer then it would not be possible to pass a string
to the function. Python is loosely typed and so the variable ab assumes the data type
of the argument that is passed to it. There are advantages and disadvantages to these

161
philosophies. It is easier to have errors in a loosely typed system as the language will
allow the passing of a variable that is other than the programmer intended. However, in a
strictly typed system the programmer made need to write more functions to accommodate
multiple types of arguments that could be passed to a function.
Code 12.6 shows a function that receives two arguments that are separated by a
comma. Line 5 calls this function and as seen there are two arguments sent to the function.
Code 12.7 attempts to call this function with two different arguments. In Line 1 the two
arguments are strings and instead of adding two integers this function now concatenates
two strings. Line 2 in Code 12.6 is the command that is used to concatenate two strings.
See Code 6.33.

Code 12.6 Using two arguments.

1 def Fun2 ( a , b ) :
2 c = a + b
3 print ( c )
4

5 >>> Fun2 ( 5 , 6 )
6 11

Code 12.7 Incorrect use of an argument.

1 >>> Fun2 ( ' hi ' , ' lo ' )


2 hilo
3 >>> Fun2 ( ' hi ' , 5 )
4

5 Traceback ( most recent call last ) :


6 File " < pyshell #22 > " , line 1 , in < module >
7 Fun2 ( ' hi ' , 5 )
8 File " < pyshell #19 > " , line 2 , in Fun2
9 c = a + b
10 TypeError : cannot concatenate ' str ' and ' int ' objects

Line 3 attempts to call the same function and now the arguments are a string and
an integer. Python does not add an integer to a string and so an error is caused. Note
that this error indicates that the problem is in Fun2 and that it occurs on Line 2 in that
function. It even shows the offending line and provides a clue as to what the problem is.

12.1.4 Default Argument

A default argument has a default definition that can be overridden by the user. An example
is shown in Code 12.8. In Line 1 the function has two arguments and the second uses an

162
equals sign to give the variable a default value. The command is called in Line 3 and the
inputs to the function are a = 9 and b = 5 as the default value. Line 5 gives the function
two arguments and in this case b = −1. Default arguments have already been used. See
Code 7.11 in which the range function is shown with different numbers of arguments.

Code 12.8 A default argument.

1 def Fun5 ( a , b =5 ) :
2 print a , b
3

4 >>> Fun5 ( 9 )
5 9 5
6 >>> Fun5 ( 9 , -1 )
7 9 -1

A default argument must be the last argument in the input stream. It is possible
to have multiple defaults as shown in Code 12.9. Here both b and c have default values.
If a function call has only 2 inputs then the second will be assigned to b. Line 10 shows a
case in which the default value for b is used and the value for c is overridden.

Code 12.9 Multiple default arguments.

1 def Fun6 ( a , b =5 , c =9 ) :
2 print a ,b , c
3

4 >>> Fun6 ( 2 )
5 2 5 9
6 >>> Fun6 ( 2, 3 )
7 2 3 9
8 >>> Fun6 ( 2 , 3 , 4)
9 2 3 4
10 >>> Fun6 ( 2 , c =-1)
11 2 5 -1

12.1.5 Help Comments

Figure 12.1 shows an interaction in the IDLE shell. The user has typed in a command
and the left parenthesis. If the user waits then a help balloon appears. This provides terse
information on the arguments that can be used in the function.
The help function provides even more information on a function as shown in Code
12.10. To create help balloons and descriptions in a function the first component in the
function are these instructions as shown in Code 12.11. Line 2 starts with three double-

163
Figure 12.1: A help balloon.

quotes. In this example there are three lines of instructions and the last line ends with
three double-quotes.

Code 12.10 The help function.

1 >>> help ( range )


2 Help on built-in function range in module __builtin__ :
3

4 range (...)
5 range ([ start ,] stop [ , step ]) -> list of integers
6

7 Return a list containing an arithmetic progression of


integers .
8 range (i , j ) returns [i , i +1 , i +2 , ... , j-1]; start (!)
defaults to 0.
9 When step is given , it specifies the increment ( or
decrement ) .
10 For example , range (4) returns [0 , 1 , 2 , 3]. The end
point is omitted !
11 These are exactly the valid indices for a list of 4
elements .

Code 12.11 Adding comments.

1 def Fun2 ( a , b ) :
2 """ First line
3 Second line
4 Third line """
5 c = a + b
6 print ( c )

Now, when the function is typed with the first parenthesis, the first help line appears
in the balloon as shown in Figure 12.2. The help function will print out all of the lines.

12.1.6 Return

The return command returns values from the function. This command is usually at the
end of the function, for when it is executed the call to the function ends. An example

164
Figure 12.2: A help balloon.

Code 12.12 Using help on a user-defined function.

1 >>> help ( Fun2 )


2 Help on function Fun2 in module __main__ :
3

4 Fun2 (a , b )
5 First line
6 Second line
7 Third line

is shown in Code 12.13. Line 3 has the return statement. Line 5 shows the call to the
function and this time the function will return a value which is placed into d.

Code 12.13 Using the return command.

1 def Fun3 ( a ) :
2 c = a + 9
3 return c
4

5 >>> d = Fun3 ( 3 )
6 >>> d
7 12

One of the unusual properties of Python is that it can essentially return multiple
items. Consider Code 12.14 which shows the return function with two variables in Line
4. This function is called in Line 6 and as seen two items are returned. In reality, the
function is only returning one item which is a tuple that contains two variables. Line 11
receives only one item the following lines show that z is actually a tuple.

12.1.7 Designing a Function

Functions can be designed to perform numerous tasks and creating such a function can be
difficult. The best idea is to plan the function before writing code. An example is shown in
Code 12.15. Here a function is declared followed by several comment statements. These
are the jobs that the function will eventually accomplish. For now, though, these are
merely ideas. The last line uses the pass function which does nothing. A function must

165
Code 12.14 Returning multiple values.

1 def Fun4 ( a , b ) :
2 c = a + b
3 d = a - b
4 return c , d
5

6 >>> m , n = Fun4 ( 5 , 6)
7 >>> m
8 11
9 >>> n-
10 1
11 >>> z = Fun4 ( 5 , 6 )
12 >>> type ( z )
13 < type ' tuple ' >
14 >>> z
15 (11 , -1)

have at least one command and so the pass command is put here as a place holder. Once
the real commands are entered the pass command can be removed.

Code 12.15 Function outlining.

1 def WordList ( fname ) :


2 # load
3 # convert to lowercase
4 # remove punctuation
5 # split
6 # return
7 pass

Now that the function is planned it is possible to start writing Python commands.
A good practice is to perform one task at a time and then test the code. Code 12.16 shows
this by example. Line 3 is an actual Python command that will load the file. Line 9 is
then called to test the new function. No errors are returned which is one requirement for
correct code.
The commands for each idea are then placed in the function and tested. The final
result is shown in Code 12.17.
Now that the function is created it is easy to apply all of the commands therein to
separate inputs. Consider Code 12.18 which calls the function WordList in line 1. The
argument is the file that contains the text for Romeo and Juliet. It returns 25,640 unique
words. The function is called again in Line 4 and the only difference is the name of the

166
Code 12.16 Adding a command.

1 def WordList ( fname ) :


2 # load
3 data = open ( fname ) . read ()
4 # convert to lowercase
5 # remove punctuation
6 # split
7 # return
8

9 >>> WordList ( ' data / romeojuliet . txt ' )

Code 12.17 Adding the rest of the commands

1 import string
2 def WordList ( fname ) :
3 # load
4 data = open ( fname ) . read ()
5 # convert to lowercase
6 data = data . lower ()
7 # remove punctuation
8 table = string . maketrans ( " ! ' &= ,.;:?[]-" ," XXXXXXXXXXXX " )
9 data2 = data . translate ( table )
10 data2 = data2 . replace ( ' X ' , ' ' )
11 # split
12 words = data2 . split ()
13 # return
14 return words

167
input file. This time 18,092 unique words are found in MacBeth.

Code 12.18 Example calls of a function.

1 >>> words = WordList ( ' data / romeojuliet . txt ' )


2 >>> len ( words )
3 25640
4

5 >>> words = WordList ( ' data / macbeth . txt ' )


6 >>> len ( words )
7 18092

12.2 Modules

A module is a Python file that can be created by the user. This file can contain Python
definitions, commands, declarations and functions. Basically, anything that can be typed
into a Python shell can be placed in a module. The module file is then stored for future
use.
Before modules are created it is prudent to create a proper working directory. An
example is shown in Figure 12.3. At the top it is seen that the file manager is in the
C:/science/ICMsigs directory. Inside of this directory are several subdirectories shown as
icons. This is a standard set of subdirectories for a working directory. For this discussion
the important subdirectory is named pysrc. It is this directory where the researcher
working on the ICMsigs project will place their Python modules.

Figure 12.3: Directory structure.

When Python is started it is necessary to move it to the working directory and then
to include the pysrc subdirectory in the search for modules. The steps are shown in Code
12.19. Line 1 imports two modules. Line 2 moves Python to this researcher’s working

168
directory. Line 3 includes the pysrc subdirectory in the search path. Now, when the user
employs the import command it will also search the pysrc directory for modules.

Code 12.19 The os and sys modules.

1 >>> import os , sys


2 >>> os . chdir ( ' C :/ science / ICMsigs ' )
3 >>> sys . path . append ( ' pysrc ' )

The IDLE environment does have a code editor and new files can be created by
selecting File:New as shown in Figure 12.4. The new file is blank and ready for editing.

Figure 12.4: Creating a new file in IDLE.

Python commands can be entered into the editor as shown in Figure 12.5. In this
case there is a variable definition, a function definition, and the execution of the function.
This file is stored in the pysrc directory and the extension “.py” is required.
Now, when the import function is called the module residing in the pysrc directory
is loaded as shown in Code 12.20.
The module can be altered as shown in Figure 12.6. If the module is changed then
in Python 2.7 the reload command to load the code. In Python 3.x this was modified and
now it is necessary to import the importlib module and from it call the reload function..
An alternative method for loading a module is to use the from ... import command
as shown in Code 12.22. In this case it is not necessary to type first.vara to access the
variable. However, the code not can be altered if this method is used.

169
Figure 12.5: Contents of a module.

Code 12.20 Importing a module.

1 >>> import first


2 hi there
3 >>> first . vara
4 8
5 >>> first . Fun10 ( ' George Mason ' )
6 George Mason

Figure 12.6: Changing the contents of a module.

170
Code 12.21 Reloading a module.

1 >>> reload ( first ) # py .27


2 >>> import importlib # py 3. x
3 >>> importlib . reload ( first )

Code 12.22 Using the from ... import construct.

1 >>> from first import vara


2 >>> vara
3 8

The final method of loading a module is to execute the file. Python 2.7 offers the
execfile command as shown in line 1 of Code 12.23. This is equivalent to typing all of the
lines in the file myfile.py into the Python shell. This function does not use the search path
and so it is necessary to define the directory location and to use the extension “.py” as
shown. This command is not available in Python 3.x and so the alternative is to read the
file using open...read and then to use the exec function to execute all of the commands
in the file.

Code 12.23 Executing a file.

1 >>> execfile ( ' mydir / myfile . py ' ) # py 2.7


2 >>> exec ( open ( ' mydir / myfile . py ' ) . read () ) # py 3. x

These commands are useful when developing code that needs to be constantly tested
during development. However, once the codes are running in good shape the import
statements should be used.

12.3 Problems

1. Create a function named Aper that receives a single argument named indata. This
function should print to the console the string “The input is: ” followed by the value
of indata.

2. Create a function like the previous but it prints the value of indata three times, each
on a separate line.

3. Create a function named Larger which receives two arguments, adata and bdata.
The function should return the larger of the two values.

4. Create a function named Complement that receives a DNA string and returns the
complement of that string.

171
5. Create a function that has as its argument a default filename (such as the file for
Romeo and Juliet). The function should return the length of the file (number of char-
acters in the file). Run the function again with a different filename which overrides
the default filename.

6. Create two functions. The first is BF which receives a string and converts the letters
to all capitals. The second is BA which receives a string and removes the spaces.
Then it passes that string to BF and receives the result. The function BA should
then return the resultant string which should be all caps and have no spaces.

172
Chapter 13

Object Oriented Programming

Object-oriented programming is a method that organizes thoughts into classes. Some


languages like Java and C# require that all data and functions belong to an object. The
C++ language provides a tremendous avenue for generating very useful objects but does
not require that objects be used. Objects are very good for large programs that are
constructed from different entities as they are a good way to keep complicated thoughts
organized.
Python also provides manners in which objects can be constructed but the advantage
of using objects in Python is much less than in other languages and could lead to a very
slow program execution.

13.1 Justification

A class is an entity that can contain data and related functions. The common example is
that of creating an address book. An entry in the address book would contain a person’s
name, address, telephone numbers, birth date and so on. The class would also contain
functions that manipulate this data. These functions could be as simple as entering data
or storing the data on a disk. The functions should be those that operate on a single
address book entry and not on a group of entries.
An object is an instance of a class. In the example of an address book, one entry
is for a person named Matthew Luedecke and another for Aimee Harper. Each of these
persons requires their own instance of an address card. So, in this example there are two
objects of the address book class.
Classes can also be built on other classes. Thus, if a company was putting together
a database of employees and customers then it is possible to use the address book class
as a building block. Both employees and customers have the information of an address
book but they also have information that is unique. Employees could have information on
their pay rate and rank, whereas the customers could have information on their purchase

173
history. Both, though, would need the information from the address book. In this case, an
address book class is created. Then an employee class is created that inherits the address
book class. The programmer creating the employee class would not need to program the
functions that deal with an address book. These classes are building blocks for a larger
programmer which are easier to create and much easier to maintain than traditional coding.
There are drawbacks to the use of classes particularly in Python. Objects tend to
run slower than other methods of programming. In a scripting language like Python there
is also the inconvenience of persistence. Consider a case in which a programmer writes
function F1 which produces data for a second function F2. However, after running F2
the user decides that there is an error that needs to be fixed and then F2 needs to be run
a second time. In Python this is easily accomplished without requiring that F1 be run
again. If all of these functions and data were contained within a class then the situation
is different. The function F2 in a class is rewritten but then the entire object will need
to be reloaded which will eliminate any data stored in the previous instance of the class.
That means that F1 would have to rerun to generate the data stored in the new instance
of the object. So, during the code development stage, a Python programmer may find
more convenience in developing the functions without using object-oriented skills. Once
the functions are bug-free then a class can be created.

13.2 Basic Contents

Data and functions are the two basic components of a class. Both should be dedicated to
the purpose of the class. In the example of the address book, both the data and functions
should be dedicated to the contents and manipulation of a single entry in the address book.
Functions that deal with with multiple address entries or the use of address information
for analysis using non-address data should exist elsewhere. Adherence to this restriction
is paramount in keeping large programs organized.

13.2.1 Class with a Function

A very simple class is shown in Code 13.1. Line 1 shows the keyword class which indicates
that a class is being defined. This is followed by the user defined name for that class.
Following that are definitions of data and functions. In this case there is only a single
function which starts on line 2. Note that this is indented thus indicating that the function
is a part of the class.
The first argument in every function is a variable named self which is discussed
in the next section. Following that are the input variables. This function does very little
except that it sets a variable named self.a to the value of the input variable ina. Line
5 shows the creation of an instance of the class. The variable v1 is not a float or integer,
but rather it is a MyClass. It contains a single function which is called in line 6. Note
that there is only one argument in this call. The variable self does not receive an input

174
Code 13.1 A very basic class.

1 >>> class MyClass :


2 def Function ( self , ina ) :
3 self . a = ina
4

5 >>> v1 = MyClass ()
6 >>> v1 . Function (4)
7 >>> v1 . a
8 4
9 >>> v2 = MyClass ()
10 >>> v2 . Function (-1)
11 >>> v2 . a-
12 1

argument from the user. Line 7 shows that the v1 now contains a variable named a which
has a value of 4. Starting in line 9 is the creation of a second instance of MyClass.
Actually, this type of usage has been seen before. Code 13.2 shows the string find
function. The string a has data and associated functions such as find.

Code 13.2 A string example.

1 >>> a = ' this is a string '


2 >>> a . find ( ' i ' )
3 2

13.2.2 Self

Perhaps the most confusing aspect of object-oriented programming is the concept of self
(or *this in C++ or this in Java). Since a class may have several instances it is important
to delineate the variables inside of a function. Consider the class shown in Code 13.3 in
which Line 2 defines a variable that belongs to the class. Line 3 defines a function that will
receive a second instance of the class and add their two variables. Consider line 7 which
creates the object m1 and sets the variable to a value of 5. Line 9 creates a second object
and line 10 sets its variable to 9. Line 11 calls the function. This function belongs to m1
and the input to the function is m2. In line 4 the self.a is the variable for m1 because this
call to the function belongs to m1 (from line 11). The variable mc.a in line 4 is associated
with m2. So, in this example, self.a = 5 and mc.a = 9.

175
Code 13.3 Demonstrating the importance of self.

1 >>> class MyClass :


2 a = 0
3 def Function ( self , mc ) :
4 c = self . a + mc . a
5 return c
6

7 >>> m1 = MyClass ()
8 >>> m1 . a = 5
9 >>> m2 = MyClass ()
10 >>> m2 . a = 9
11 >>> m1 . Function ( m2 )
12 14

13.2.3 Global and Local Variables

A local variable is one that exists only inside of a function and a global variable is one that
can be seen outside of the function. Consider Code 13.4 which has a function inside of
the class. The variable c is a global variable and is accessible to all functions inside of the
class as well as accessible outside of the class. As shown line 5 the access inside of the
function uses self.c. Line 16 shows access to the variable in the object.

Code 13.4 Distinguishing local and global variables.

1 >>> class MyClass :


2 c = 4
3 def Function ( self , ina ) :
4 self . a = ina
5 b = ina + self . c
6

7 >>> v1 = MyClass ()
8 >>> v1 . Function (4)
9 >>> v1 . a
10 4
11 >>> v1 . b
12 Traceback ( most recent call last ) :
13 File " < pyshell #148 > " , line 1 , in < module >
14 v1 . b
15 AttributeError : ' MyClass ' object has no attribute ' b '
16 >>> v1 . c
17 0

176
The variable self.a is also a global variable since it has self. in its declaration.
The variable b, on the other hand, is a local variable. It is used inside of the function
and once the program exits the function the variable ceases to exist. As seen in line 11
an attempt to access this variable results in an error because it was destroyed when line
8 finished its execution.

13.2.4 Operator Overloading

There are several predefined operators in Python. For example, the addition of two floats
uses the plus sign which is an operator. Somewhere along the line the computer must have
a definition of what to do when it sees the combination of a float, a plus sign, and a float.
It is possible to define the operator for a class. Consider the case of the address book
entries. One entry was for Aimee and another for Matthew. Theoretical code (code that
does not exist) is shown in Code 13.5. The address book entries for Aimee and Matthew
are created and in line 6 they are added. The programmer would have to define what is
meant by the addition of two addresses. Perhaps the function will create a new person
taking the first name from one person and the last name from the other. In fact, line 7 is
an overload of the string function which is used by print.

Code 13.5 Theoretical code showing implementation of a new definition for the addition
operator..

1 # theoretical code
2 >>> person1 = AddressBookEntry ( )
3 >>> person1 . SetName ( ' Aimee ' , ' Harper ' )
4 >>> person2 = AddressBookEntry ( )
5 >>> person2 . SetName ( ' Matthew ' , ' Luedecke ' )
6 >>> clone = person1 + person2
7 >>> print ( clone )
8 Aimee Luedecke

A simple example is shown in Figure 13.6 with the function add that does have
two underscores before and two after the name add. This function will define the addition
operator for the class. This operator must receive one argument besides self which is the
data from the right side of the plus sign. Line 5 creates the class and line 6 changes the
value of the variable. Line 7 uses the plus sign. The value of 6 is to the right of the plus
sign and so d = 6 in line 4. Since v1 is to the left of the plus sign, the self.a will be
v1.a.
There are many different operators that can be overloaded. Table 13.1 shows a
subset of the possibilities.
Code 13.7 shows four more overloaded functions that are not in the above table.
The first one is init which is the constructor function. This function is automatically

177
Code 13.6 Overloading the addition operator.

1 >>> class MyClass :


2 a = 0
3 def __add__ ( self , d ) :
4 return self . a + d
5 >>> v1 = MyClass ()
6 >>> v1 . a = 7
7 >>> v1 + 6
8 13

Table 13.1: Operators than can be overloaded.

Name Symbol Function


Addition p1 + p2 p1. add (p2)
Subtraction p1 - p2 p1. sub (p2)
Multiplication p1 * p2 p1. mul (p2)
Power p1 ** p2 p1. pow (p2)
Division p1 / p2 p1. truediv (p2)
Remainder (modulo) p1 % p2 p1. mod (p2)
Bitwise AND p1 & p2 p1. and (p2)
Bitwise OR p1 | p2 p1. or (p2)
Bitwise XOR p1 ˆp2 p1. xor (p2)
Bitwise NOT ˜ p1 p1. invert ()
Less than p1 < p2 p1. lt (p2)
Less than or equal to p1 <= p2 p1. le (p2)
Equal to p1 == p2 p1. eq (p2)
Not equal to p1 != p2 p1. ne (p2)
Greater than p1 > p2 p1. gt (p2)
Greater than or equal to p1 >= p2 p1. ge (p2)

178
called when an object is created. Line 14 creates an object and that line will also execute
line 3 which creates a list with N entries that are all 0.

Code 13.7 Examples for overloading slicing and string conversion.

1 >>> class MyVector :


2 def __init__ ( self , N ) :
3 self . vec = N *[0]
4 def __setitem__ ( self , n , val ) :
5 self . vec [ n ] = val
6 def __getitem__ ( self , n ) :
7 return self . vec [ n ]
8 def __str__ ( self ) :
9 tempstr = ' Values : '
10 for i in self . vec :
11 tempstr += ' : ' + str ( i )
12 return tempstr
13

14 >>> v1 = MyVector (5 )
15 >>> v1 [1] = 9
16 >>> v1 [1]
17 9
18 >>> print ( v1 )
19 Values : : 0 : 9 : 0 : 0 : 0

Line 4 overloads the setitem function. This function is used to set the value of
an item in a list, tuple or array. Line 15 calls this function. Line 6 defines the getitem
function which retrieves the value of an element in tuple, list or array. Line 16 calls this
function. Finally, line 8 defines the str function which creates a string that the print
function calls. This function must return a string (line 12). The contents of that string
are defined by the programmer. A call to this function occurs with line 18.

13.2.5 Inheritance

Inheritance is the ability of one class to adopt the data and attributes of other classes.
An example is shown in Code 13.8. Lines 1 through 6 create a class named Human. This
has a first and last name and the ability nicely print that information as shown in lines 7
through 11. Line 12 starts the definition of a new class named Soldier which has Human
in parenthesis. This means that all of the data and functions defined in Human are also
in Soldier. Basically, Soldier is a Human. The programmer need only to write code in
Soldier for those variables and functions that are unique to a soldier. In this case, only
the rank variable is added. Lines 14 through 16 declare a new soldier and line 17 calls the
function defined in line 4.

179
Code 13.8 An example of inheritance.

1 >>> class Human :


2 firstname = ' '
3 lastname = ' '
4 def __str__ ( self ) :
5 tstr = ' My Name is : ' + self . firstname + ' ' + self . lastname
6 return tstr
7 >>> p = Human ()
8 >>> p . firstname = ' Howard '
9 >>> p . lastname = ' Jones '
10 >>> print ( p )
11 My Name is : Howard Jones
12 >>> class Soldier ( Human ) :
13 rank = ' Private '
14 >>> s = Soldier ()
15 >>> s . firstname = ' John '
16 >>> s . lastname = ' Smith '
17 >>> print ( s )
18 My Name is : John Smith

Inherited classes are particularly useful for very complex programs. Each class is a
building block and the ability to inherit always building blocks to be stacked on top of
others.
A class may inherit multiple classes by separating them with commas in the declara-
tion. For example class NewClass( Class1, Class2) would be used to allow NewClass
to be built from both Class1 and Class2.

13.2.6 Actively Adding a Variable

One of the features of Python is that is has the ability to add new variables to a class once
the instance has been created. This is shown in Code 13.9 which continues from Code
13.8. Line 1 creates a new variable ssn and attaches it to the current instance of Soldier.
Lines 2 and 3 confirm that this was acceptable. Line 5 creates a new instance of Soldier
and as seen by the error generated from line 6, this instance does not have ssn.
The good news is that new variables can attached to objects after the object has
been created. The bad news is identically the same. In languages like C++ a variable
must declared inside of the object before an instance is created. Thus, the coding shown
in Code 13.9 is not possible. However, this is a good way to catch typos during coding.
In the case of Python there is no safeguard. For example, the variable lastname already
exist and the case may arise that after marriage the person needs to change their last
name in the database. The programmer could write s.lastname = ’Kershaw’ but they

180
Code 13.9 Creating new variables after the creation of an object.

1 >>> s . ssn = 123456789


2 >>> s . ssn
3 123456789
4

5 >>> t = Soldier ()
6 >>> t . ssn
7 Traceback ( most recent call last ) :
8 File " < pyshell #265 > " , line 1 , in < module >
9 t . ssn
10 AttributeError : ' Soldier ' object has no attribute ' ssn '

could also make a mistake and write s.lasname = ‘Kershaw’. In Python a new variable
is created and the old variable is not changed. The typo did not generate an error like it
would in other languages.

181
182
Chapter 14

Random Numbers

Random numbers are just what their name implies. They are numbers that are generated
by a program that are independent. The second random number has nothing to do with
the first random number.
While the concept is easy, the interesting question is how does a computational
engine generate random numbers. There has been a field of study dedicated to the com-
putational process of generating purely random numbers. This chapter will review uses of
random numbers and the features of some of the Python functions.

14.1 Simple Random Numbers

The numpy module provides a package of random number generations. The random.rand
and random.ranf functions create random numbers that are equally distributed between
0 and 1. Code 14.1 shows the generation of a single random number.

Code 14.1 A random number.


1 >>> import numpy as np
2 >>> np . random . rand ()
3 0.83 68 20 09 19 47 23 26

This same function can be used to generate a long vector of random numbers as
shown in Line 1 of Code 14.2. This generates 100,000 random numbers. Since they are
evenly distributed between 0 and 1 then the average should be very close to 0.5. This is
shown to be the case in Lines 2 and 3.

183
Code 14.2 Many random numbers.

1 >>> v = np . random . rand ( 100000 )


2 >>> v . mean ()
3 0.4998 34 7 0 60 2 7 35 9 6 3

14.2 Randomness

Are these numbers truly random?


First if the numbers are evenly distributed then the histogram, shown in Figure 14.1,
of the values should be close to flat. Consider the range of value of the y axis and it can
be seen that this histogram is nearly flat.

Figure 14.1: Histogram of random numbers.

That is not a sufficient test to determine if a set of number are truly random. It is
possible that a function can generate a set of random numbers but the generator becomes
repetitive as shown in Figure 14.2 where the pattern repeats after x = 1024. The average
of these numbers is still 0.5 and the histogram is flat, but the generator is not really
generating random numbers.
One way of determining if a function is repetitive is to perform an auto-correlation.
This function computes the inner product for all possible shifts of a function. If a function
is not repetitive (and it is zero-sum) then the auto-correlation will have a single simple
spike because there is only one possible shift of a function with itself in which the values
are self-similar. The auto-correlation is shown in Figure 14.3,
The scipy module offers a correlate function in the signal package. This is shown in
Code 14.3. Line 2 creates a vector of zero-sum random numbers and Line 3 makes a new

184
Figure 14.2: A repeating function.

Figure 14.3: The auto-correlation of zero-sum random numbers.

185
vector that has this original vector repeating 10 times. Thus, this is a vector of random
numbers with a repeating sequence. Line 4 performs the auto-correlation.

Code 14.3 A correlation.


1 >>> import s c i p y . s i g n a l a s s g
2 >>> a = 2 * np . random . rand ( 1 5 ) -1
3 >>> vec = np . a r r a y ( 10 * l i s t ( a ) )
4 >>> c r = s g . c o r r e l a t e ( vec , vec )

If the sequence is repeating then there are several shifts of the data that aligns with
the original data. Thus there are several spikes in the auto-correlation as shown in Figure
14.4.

Figure 14.4: The auto-correlation of a repeating sequence.

14.3 Gaussian Distributions

There are other types of random distributions but the only one that is reviewed here is
the Gaussian distribution. These are not evenly distributed between 0 and 1, but are
distributed according a bell curve function.

14.3.1 Gaussian Function

The Gaussian function is,


2 /2σ 2
y = Ae−(x−µ) , (14.1)

186
where A is the amplitude, µ is the average and σ is the standard deviation. The average
of a set of numbers is computed by,
N
1 X
µ= xi , (14.2)
N
i=1

where N is the number of samples and xi are the samples. The standard deviation is
computed by, v
u
u1 X N
σ=t (xi − µ)2 . (14.3)
N
i=1

The standard bell curve is shown in Figure 14.5. The amplitude is the height of the
function, the average is the horizontal location, and the standard is the half-width and
half-maximum.

Figure 14.5: The Gaussian distribution.[Kernler, 2014]

14.3.2 Gaussian Distributions in Excel

Figure 14.6 shows the function that computes the Gaussian values in Excel. This is plotted
as shown in Figure 14.7.
Excel requires some inputs from the user to generate a histogram. The procedures
begins in Figure 14.8. On the left is the original data. On the right the user manually
enters the bins for the histogram. The next step is to select Data Analysis as shown in
Figure 14.9. This selection will produced the popup menu shown in Figure 14.10. The
user selects Histogram.
The selection Histogram computes values placed on a new sheet as shown in Figure
14.11. These are the bins and frequencies of the histogram. The plot of these values is
shown in Figure 14.12.

187
Figure 14.6: The Gaussian distribution in Excel.

Figure 14.7: The Gaussian distribution in Excel.

Figure 14.8: The Gaussian distribution in Excel.

188
Figure 14.9: Selecting Data Analysis.

Figure 14.10: The popup menu.

Figure 14.11: The results.

189
Figure 14.12: The plot of the results.

14.3.3 Histogram in Python

Figure 14.13 shows the Python command to compute a histogram. This process is paused
to show the help balloon to assist the user in providing the correct information. Code 14.4
shows the command and the returned results. The x values are the bin values and the y
values are the frequencies.

Figure 14.13: The help balloon.

Code 14.4 A histogram in Python.


1 >>> y , x = np . h i s t o g r a m ( v , 4 , ( 1 , 5 ) )
2 >>> x
3 array ( [ 1 . , 2 . , 3 . , 4. , 5.])
4 >>> y
5 array ( [ 2 , 2 , 2 , 1 ] )

14.3.4 Random Gaussian Numbers

The scipy.random package offers the normal function which generates random numbers
based on a Gaussian distribution instead of a flat distribution. The call to the function is
shown in Code 14.5.

190
Code 14.5 Help on a normal distribution.

1 >>> import numpy as np


2 >>> help ( np . random . normal )
3 Help on built-in function normal :
4

5 normal (...)
6 normal ( loc =0.0 , scale =1.0 , size = None )

Code 14.6 shows the call with three arguments. The first is the location or mean,
the second is the scale or the standard deviation, and the third is the number of random
numbers to be generated. Thus, this call produces 2 random numbers that are based on
the distribution of µ = 2.0 and σ = 1.3.

Code 14.6 A normal distribution in Python.

1 >>> pts = np . random . normal ( 2.0 , 1.3 , 2 )


2 >>> pts
3 array ([ 2.38311333 , 2.25896209])

Code 14.7 is the same call except that it generates 10,000,000 numbers in this dis-
tribution. This is such a large sample that the average and standard deviation of this
sample should match the input parameters. This is so as depicted in Lines 2 through 5.

Code 14.7 A larger distribution.

1 >>> pts = np . random . normal ( 2.0 , 1.3 , 10000000 )


2 >>> pts . mean ()
3 2.00 00 91 08 56 55 46 83
4 >>> pts . std ()
5 1.29 95 98 26 03 83 50 59

14.4 Multivariate Function

In many cases there is more than one input variable. Consider a case where the inves-
tigation concerns human health. The output is the probability of contracting a specific
disease but the input is a list of factors such as,

ˆ Cigarettes,
ˆ Drinking, and
ˆ Exercise

191
There is need for a distribution function that has several inputs. As these are difficult to
draw with more than two inputs a simple case is considered. A Gaussian distribution with
two input parameters is shown in Figure 14.14. The two horizontal axes are the inputs
and the vertical axis is the output.

Figure 14.14: A Gaussian distribution in 2D.[Gnu, 2016]

The projection of a 2D Gaussian function is shown in Figure 14.15. Each projects


shows that the complicated function is actually a Gaussian function for each input variable.

Figure 14.15: A Gaussian distribution in 2D.[Bscan, 2013]

The multivariate function is,


1 T Σ−1 (~
y(~x) = Ae− 2 (~x−~µ) x−~
µ)
. (14.4)

This equation is actually similar to Equation (14.1). Both equations have an amplitude
A. In the exponent both equations have a − 12 . In (14.1) the (x − µ)2 is replaced by vector
~ ). The σ −1 is replaced by Σ−1 which is the covariance matrix. The
~ )T (~x − µ
forms (~x − µ
diagonal elements are related to the variances of the individual components. So, Σ1,1 is
related to the variance of the first variable.
The off-diagonal elements are related to the covariance. So, Σi,j is the covariance
between the i-th and j-th variable. This value is positive if the two variables are linked.
So, if xj goes up when xi goes up then there is a positive covariance. If xj goes down when
xi goes up then there is a negative covariance. If the two variables have nothing to do with
each other then the are independent and their covariance value is 0. The vector µ ~ controls

192
the location of the center of the distribution and Σ controls the shape and orientation of
the distribution.
Scipy offers the multivariate normal function which generates random vectors
based on a multivariate distribution. This is shown in Code 14.8.

Code 14.8 A multivariate distribution in Python.

1 >>> v = np . array ( (0.4 , 0.3) )


2 >>> mat = np . array ( (( 1. , 0) , (0. , 1) ) )
3 >>> pts = np . random . m ul t i va r i at e _ no r m al (v , mat , 20 )
4 >>> pts . shape
5 (20 L , 2 L )

Code 14.9 displays a small test. The first two lines generate the location and covari-
ance matrix of the distribution. Line 3 generates 100,000 random vectors based on this
distribution. Line 5 computes the covariance matrix based on the generated data which
is similar to the matrix that created the data (Line 2). Likewise, Line 7 computes the
average of the vectors and this matches the generating vector of Line 1.

Code 14.9 Computing the statistics of a large multivariate distribution.

1 >>> v = np . array ( (0.4 , 0.3) )


2 >>> mat = np . array ( (( 1. , 0.2) , (0.2 , 0.5) ) )
3 >>> pts = np . random . m ul t i va r i at e _ no r m al (v , mat , 100000 )
4 >>> np . cov ( pts . transpose () )
5 array ([[ 0.99951825 , 0.19643719] ,
6 [ 0.19643719 , 0.49675993]])
7 >>> pts . mean (0)
8 array ([ 0.40082911 , 0.3025799 ])

14.5 Examples

This section has several examples that use random number generators.

14.5.1 Dice

Code 14.10 shows a script for simulating rolling a single die. There are six sides each with
an equal chance of being on the up side. So, Line 2 creates the six choices and Line 3
makes a single choice simulating a single roll of the die.
The random.choice@choice function will select one item at random from a list.
A second argument is the number of selects that are to be made. Thus, Line 1 in Code

193
Code 14.10 Random dice rolls.
1 >>> import numpy as np
2 >>> dice = [1 ,2 ,3 ,4 ,5 ,6]
3 >>> np . random . choice ( dice )
4 2

14.11 simulates the rolling of two dice. Two more examples are shown in the following
lines.
Code 14.11 Random dice rolls.
1 >>> np . random . choice ( dice ,2 )
2 array ([1 , 2])
3 >>> np . random . choice ( dice ,2 )
4 array ([1 , 4])
5 >>> np . random . choice ( dice ,2 )
6 array ([3 , 6])

Code 14.12 rolls two dice 1000 times and captures all the sum of each pair of dice.
The histogram of these rolls is stored and show in Figure 14.16. As seen it is far more
common to roll a 7 than it is to roll a 2.
Code 14.12 Distribution of a large number of rolls.

1 >>> a = np . zeros ( 1000 )


2 >>> for i in range ( 1000 ) :
3 a [ i ] = np . random . choice ( dice ,2) . sum ()
4 >>> y , x = np . histogram (a ,12 ,[1 ,13] )
5 >>> y
6 array ([ 0 , 27 , 50 , 93 , 109 , 137 , 162 , 141 , 104 , 84 ,
...])
7 >>> x
8 array ([ 1. , 2. , 3. , 4. , 5. , 6. , 7. , 8. , ...])
9 >>> import gnu
10 >>> mat [: ,0] = x [:-1]
11 >>> mat [: ,1] = y
12 >>> gnu . Save ( ' dud . txt ' , mat )

14.5.2 Cards

This section shows how to create a deck of cards and to shuffle them. Line 1 in Code
14.13 creates a list of the face values of the cards and Line 2 creates a list of the suits.

194
Figure 14.16: Histogram of rolling 2 dice.

The for loops started in line 4 create the full deck of cards some of which are printed to
the console.

Code 14.13 Random cards.


1 >>> nos = [ ' A ' , ' 2 ' , ' 3 ' , ' 4 ' , ' 5 ' , ' 6 ' , ' 7 ' , ' 8 ' , ' 9 ' , ' 10 ' , ' J ' , ' Q ' , '
K ']
2 >>> suits = [ ' spades ' , ' diamonds ' , ' clubs ' , ' hearts ' ]
3 >>> cards = []
4 >>> for i in nos :
5 for j in suits :
6 cards . append ( i + ' ' + j )
7 >>> cards [:10]
8 [ ' A spades ' , ' A diamonds ' , ' A clubs ' , ' A hearts ' ,
9 ' 2 spades ' , ' 2 diamonds ' , ' 2 clubs ' , ' 2 hearts ' ,
10 ' 3 spades ' , ' 3 diamonds ' ]

The random.shuffle function rearranges the items in the list, which in this case is
equivalent to shuffling the deck. The result is shown in Code 14.14.

14.5.3 Random DNA

This section creates a random string from a finite alphabet. The example is to create a
DNA string and so the alphabet is merely four letters, A, C, G and T.
Code 14.15 shows a method by which this can be done. Line 1 establishes the
alphabet. Line 2 creates 100 random numbers which will determine the 100 random

195
Code 14.14 Shuffled cards.
1 >>> np . random . shuffle ( cards )
2 >>> cards [:10]
3 [ ' 9 diamonds ' , ' 4 spades ' , ' 8 hearts ' , ' 9 spades ' ,
4 ' 6 spades ' , ' 9 clubs ' , ' 7 clubs ' , ' Q diamonds ' ,
5 ' K hearts ' , ' 3 hearts ' ]

characters. Line 3 converts the random numbers to random integers from 0 up to 4. Line
4 extracts from the alphabet the letters according to the positions listed in r. In this case
the first few values in r are [0,2,1,3,1,2...] and so the first few letters in the string
are AGCTCG...]. Line 5 converts the list to a single string.

Code 14.15 Random DNA.


1 >>> abet = list ( ' ACGT ' )
2 >>> r = np . random . rand ( 100 )
3 >>> r = (4* r ) . astype ( int )
4 >>> s = np . take ( abet , r )
5 >>> s = ' ' . join ( s )
6 >>> s
7 ' AGCTCGCTCCACCTGGCATTTCGTGAACCTGCACTCATAGACAT
8 ATATGATTAGGGTTACCTTTTCAAACGGAGTCGCCTGATGACTAC
9 TAGACTCCACC '

196
Problems

1. Compute the average of sets of random numbers. The number of samples in the sets
should be 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048 and 4096. Plot the average
of the random values in each set versus the number of samples.

2. Compute the average of 10,000 samples of x2 where x represents random numbers.



3. Compute the average of 10,000
√ samples of x where x represents random numbers.
Is the result the same as 0.5?

4. Plot the histogram of 10,000 samples from a normal distribution with µ = 0.5 and
σ = 0.3.

5. Plot the histograms of two normal distributions. The first has 10,000 samples with
µ = 0.5 and σ = 0.4. The second has 9,000 samples with µ = 0.3 and σ = 0.2. What
is the value of x where the two distributions cross over?

6. Create a random DNA string with 1000 letters, but the probability of having an ’A’
is twice as much as the other three letters.

7. Create a random amino acid string with 1000 letters.

197
198
Chapter 15

Gene Expression Arrays: Python

Chapter 4 demonstrated a method of normalizing gene expression array in Excel. Some


of the steps were automated and some of the steps, such as sorting, required user inter-
vention. Each file required the user to perform several steps and the process was not fully
automated.
A programming language is more versatile and can therefore fully automate this
same process. This chapter will perform the same steps as Chapter 4 but it will do so
using Python scripts. In the end, the user will need to merely provide the file names and
the programming script will perform all of the steps.

15.1 Protocol

This section will repeat the same computations as in Chapter 4 with Python scripts. These
steps are:

1. Load the data,


2. Subtract the background,
3. Compute R/G and I,
4. Compute M and A,
5. LOESS normalization,
6. Plot these values,
7. Normalize,
8. Repeat for all files, and
9. Answer a question.

199
15.2 A Single File

Code 15.1 displays the LoadExcel function that uses the xlrd module to load directly
from the spreadsheet. There is only one sheet in this workbook and it is named ‘Export’.
This data is collected in line 4. Lines 6 through 10 finds the row with the string “Begin
Data” which signifies where the data rows are found.

Code 15.1 The LoadExcel function.


1 # mapython . py
2 def LoadExcel ( fname ) :
3 wb = xlrd . open_workbook ( fname )
4 sheet = wb . sheet_by_name ( ' Export ' )
5 start = -1
6 for i in range ( sheet . nrows ) :
7 row = sheet . row ( i )
8 if ' Begin Data ' == row [0]. value :
9 start = i
10 break
11 ldata = []
12 for i in range ( 1651 ,1651+1600) :
13 row = sheet . row ( i )
14 t = []
15 for j in (0 ,5 ,8 ,9 ,20 ,21) :
16 t . append ( row [ j ]. value )
17 ldata . append ( t )
18 return ldata

The actual reading of the data begins in line 12. Each row is collected and only the
pertinent columns of data are stored which is performed in lines 15 and 16. The result is
a list and each item in this list is a list of six items. These are the gene number, name,
channe 1 intensity, channel 1 background, channel 2 intensity and channel 2 background.
There are 1600 rows of data and so efficiency in processing can be gained by putting
the last four channels into matrices. The function Ldata2Array shown in Code 15.2
creates two matrices intens and backg. The first has two columns and 1600 rows which
are the measured intensities of the two channels. The matrix backg is the same size and
is the measured background intensities of the two channels.
The next step is to subtract the background from the intensity. However, there are
a few spots that have issues either in construction or detection in which the intensity level
is less than the background. These need to be removed. This process is started in the
function MA shown in Code 15.3. Line 3 performs the subtraction. Line 4 creates the
variable mask whic contain binary values. These are 1 for the cases in which the subtraction
produces a positive value and 0 for those few cases in which there is a negative value. Line

200
Code 15.2 The Ldata2Array function.

1 # mapython . py
2 def Ldata2Array ( ldata ) :
3 N = len ( ldata )
4 intens = np . zeros ( (N ,2) )
5 backg = np . zeros ( (N ,2) )
6 for i in range ( N ) :
7 intens [i ,0] = ldata [ i ][2]
8 intens [i ,1] = ldata [ i ][4]
9 backg [i ,0] = ldata [ i ][3]
10 backg [i ,1] = ldata [ i ][5]
11 return intens , backg

5 keeps those values that are positive and replaces the negative values with the value of 1.

Code 15.3 The MA function.


1 # mapython . py
2 def MA ( intens , backg ) :
3 vals = intens - backg
4 mask = vals > 0
5 vals = mask * vals + (1-mask ) *1
6 rg = vals [: ,0]/ vals [: ,1]
7 inte = ( vals [: ,0] + vals [: ,1]) /2
8 M = np . log2 ( rg )
9 A = np . log2 ( inte )
10 return M , A

The process then replicates that in Chapter 4. The next step is to calculate the
ratio R/G and the average I. The log2 of these values create the values M and A. This
function returns two vectors which are the M and A values for a single file.
The gnu module provides a function to save the data for a plotting program and
this is called in the Plot function shown in Code 15.4. A matrix named temp is created
to hold the data and this is sent to the Save function for plotting. The result is the same
as shown in Figure 4.8.
LOESS normalization is performed in the LOESS function shown in Code 15.5.
This follows the process described in Section 4.4 where the first step is to sort the data
according to the values of A. The sort order is obtained in line 5 and the gene numbers
are created in line 4 and sorted in line 6. The values of M are sorted in line 7.
The for loop begins the normalization process. Lines 10 through 15 set up the limits
for local averages with alterations for those cases where the data point is near either the

201
Code 15.4 The Plot function.
1 # mapython . py
2 def Plot ( M , A , outname ) :
3 N = len ( M )
4 temp = np . zeros ( (N ,2) )
5 temp [: ,0] = A
6 temp [: ,1] = M
7 gnu . Save ( outname , temp )

Code 15.5 The LOESS function.


1 # mapython . py
2 def LOESS ( M , A ) :
3 N = len ( M )
4 nmbs = np . arange ( N )
5 ag = A . argsort ()
6 nmbs = nmbs [ ag ]
7 Msort = M [ ag ]
8 Mloess = np . zeros ( M . shape )
9 for i in range ( N ) :
10 before = i-25
11 if before < 0:
12 before = 0
13 after = i + 26
14 if after >= N :
15 after = N-1
16 avg = Msort [ before : after ]. mean ()
17 Mloess [ i ] = Msort [ i ]-avg
18 ag = nmbs . argsort ( )
19 Mloess = Mloess [ ag ]
20 Mloess -= Mloess . mean ()
21 Mloess /= Mloess . std ()
22 return Mloess

202
beginning of the list or the ending of the list. Once the limits are established the average
is computed in line 16 and is subtracted from the value in line 17.
The final two lines prepare these values for comparison with other files by subtracting
the average and dividing by the standard deviation. This follows the process from Section
4.5. Code 15.6 shows the steps for processing a single file.

Code 15.6 Processing a single file

1 >>> import mapython as mpy


2 >>> fname = ' marray / GSM151667 . xls '
3 >>> ldata = mpy . LoadExcel ( fname )
4 >>> intens , backg = mpy . Ldata2Array ( ldata )
5 >>> M , A = mpy . MA ( intens , backg )
6 >>> Mloess = mpy . LOESS ( M , A )
7 >>> mpy . Plot ( Mloess , A , ' plot . txt ' )

15.3 Multiple Files

With the ability to process a single file to the normalized LOESS values it is possible to
compare values from different files. The first step is to collect the names of the files to
be used. It is assumed that the files are all in a subdirectory and that there are no other
Excel files in that subdirectory. The GetNames function shown in Code 15.7 gathers all
of the names from a directory and places into a list named names all of those files that are
Excel files. These names come complete with the directory string. The output is a list of
Excel file names.

Code 15.7 The GetNames function.


1 # mapython . py
2 def GetNames ( indir ) :
3 a = os . listdir ( indir )
4 names = []
5 for i in a :
6 if ' . xls ' in i :
7 names . append ( indir + ' / ' + i )
8 return names

The data from all of the files can now be collected. This is performed in the AllFiles
function shown in Code 15.8. The input is the list of names. Inside of the for loop it
loads the data, converts the data to matrices, performs the calculations and places the
M values in a column of the matrix mat. This matrix has 1600 rows and the number of

203
columns is the same as the number of file names in names. The output is a matrix with
all of the normalized values.

Code 15.8 The AllFiles function.


1 # mapython . py
2 def AllFiles ( names ) :
3 N = len ( names )
4 mat = np . zeros ( (1600 , N ) )
5 for i in range ( N ) :
6 print ( names [ i ])
7 ldata = LoadExcel ( names [ i ] )
8 intens , backg = Ldata2Array ( ldata )
9 M , A = MA ( intens , backg )
10 mat [: , i ] = LOESS ( M , A )
11 return mat

Now, the data from all of the files is collected and the user can ask questions of the
data. One example is to collect the genes that are expressed for males but not females. In
order to ask this question only three files can be used. In the example data set there are
10 files but only 3 are used to pursue this question, so it is necessary to define a variable
to designate which columns in mat will be used. This is the list cols which simply lists
the column numbers. It is also necessary to designate if the expressed value is expected to
be greater than 1 or less than -1. Binary values in the list sels provide this information.
The call for this function is shown in the last line of Code 15.9. The input data is
the output from AllFiles. The second argument indicates that only columns 0, 4 and 8
will be used. The third argument indicates that in the first column the search is for values
of 1 or more, the second column is searched for values of -1 or less, and the last column is
searched for values of 1 or more.
The Select functions extracts the needed columns in lines 7 and 8. The loop started
on line 9 begins the search for the desired values. In this case, the matrix temp has thre
columns and 1600 rows. The values are 1 if the gene is expressed and 0 otherwise. The
vector tot sums temp horizontally. Line 15 searches for those values that are 2 or higher
indicating which rows had at least files that expressed the gene.
Finally, the Isolate in Code 15.10 finds the genes of interest. There are at least two
samples of each gene in the data file and so values are collected according to gene name.
Line 3 creates a new dictionary and the key will be the gene name. The input hits is
the data from Select and is the gene number of those genes that are expressed. The loop
started in line 4 considers each of these. If the gene has been seen before then line 7 is
used to append the gene number to the list in the dictionary entry. If the gene had not
been seen before then line 9 is used to create the dictionary entry.
The search is for cases where the gene is expressed in at least two files for both

204
Code 15.9 The Select function.
1 # mapython . py
2 def Select ( mat , cols , sels ) :
3 answ = []
4 N = len ( mat )
5 M = len ( cols )
6 temp = np . zeros ( (N , M ) )
7 for i in range ( M ) :
8 temp [: , i ] = mat [: , cols [ i ]]
9 for i in range ( M ) :
10 if sels [ i ] == 1:
11 temp [: , i ] = temp [: , i ] > 1
12 else :
13 temp [: , i ] = temp [: , i ] < -1
14 tot = temp . sum (1)
15 hits = ( tot >=2) . nonzero () [0]
16 return hits
17

18 >>> hits = mpy . Select ( data , [0 ,4 ,8] ,[1 ,0 ,1] )

samples of the gene. This is performed in lines 14 through 16. The few results are listed
and these match those from Chapter 4.

205
Code 15.10 The Isolate function.
1 # mapython . py
2 def Isolate ( hits , ldata ) :
3 genes = {}
4 for i in hits :
5 gene = ldata [ i ][1]
6 if gene in genes :
7 genes [ gene ]. append ( i )
8 else :
9 genes [ gene ] = [ i ]
10 return genes
11

12 >>> genes = mpy . Isolate ( hits , ldata )


13 >>> k = genes . keys ()
14 >>> for i in k :
15 if len ( genes [ i ]) >=2:
16 print ( i )
17 protein phosphatase 5 , catalytic subunit
18 ESTs
19 DMSO
20 intercellular adhesion molecule 1 ( CD54 ) , human rhinovirus
receptor
21 phospholipase C , beta 2

206
Part III

Computational Applications

207
Chapter 16

DNA as Data

This chapter reviews some of the basic ideas of DNA and then proceeds to consider pro-
grams to read in the standard files. This chapter concludes with a couple of applications.

16.1 DNA

Each cell in any animal or plant contains a vast amount of DNA (deoxyribonucleic acid).
A typical cell contains a nucleus surrounded by cytoplasm as depicted in Figure 16.1.

Figure 16.1: A simple depiction of a cell with a nucleus, cytoplasm, nuclear DNA and mitochondrial
DNA.

Within the nucleus of a human cell resides 22 chromosomes plus either and XX or
XY chromosome depending on the gender. These chromosomes contain strands of DNA
which are coiled as a double helix as shown in Figure 16.2. This helix though is precisely
folded several times to allow a large amount of DNA can fit into a tiny cell.
Connecting the two helices are nucleotides of which there are four different variations:

209
Figure 16.2: A caricature of the double helix nature of DNA.

ˆ A: adenine,
ˆ T: thymine,
ˆ C: cytosine, and
ˆ G: gaunine.

Each of these are commonly represented by their first letter. Thus, a long strand of DNA
is represented by a long string of letters from a four letter alphabet.
The opposing helix contains the complementary strand. Wherever the first strand
has an A the complement as a T. Likewise, the complement of T is A. The C and G are
also complements. Thus, if the DNA sequences in one strand are known then the sequence
in the complementary strand is also known.
Within a single human cell the nucleus contains over 3 billion nucleotides. If this
DNA were unfolded and connected into a single strand it would be about 10 meters long.
So, complicated folding is absolutely required.
Not all of the DNA is located in the nucleus. Mitochondrial DNA is located in the
cytoplasm. These are short rings of DNA that are inherited from the biological mother.
Viruses and bacteria are also constructed from DNA.
Segments of DNA contain the information needed to create proteins which are long
strands of amino acids. However, vast regions of the DNA are not used for this purpose.
The process of creating proteins begins with the DNA unfolding to expose segments of the
helix. These segments are replicated creating short strands of mRNA (messenger RNA)
which escape from the nucleus into the cytoplasm. During this process, thymine is convert
to uracil and so the represented string replace T with U.
Once in the cytoplasm a ribosome attaches to the DNA and the traverses the strand
building a protein as depicted in Figure 16.3. In this process, three nucleotides are trans-
lated into an amino acid, and when completed this chain is the protein. The group of
three nucleotides is called a codon. The translation table from codon to amino acid is
shown in Figure 16.4. So in the image the first codon ACC is used to create T, GAC is
used to D, etc.

210
Figure 16.3: The ribosome travels along the DNA using codon information to create a chain of
amino acids.

Of course, the process is not at all as simple as this. There are several complications
some of which require intense study to comprehend. One of the major complications is
that the gene may be encoded as splices in the DNA. To create a single gene, several
locations in the DNA are used. Figure 16.5 shows the case where four different splices
(labeled A, B, C and D) are used to create a long strand of mRNA which is then translated
into a single protein. The coding regions are named exons and the intermediate regions
are called introns. Splicing can be even more complicated. It is possible that a gene is
created from exons A, B and D while a different gene is created from exons A, C and D.
Genes can exist on either strand of the helix. Detecting these genes is a science in
itself. Commonly, the beginning of a gene has ATG as the first codon. This is named
the start codon. However, this combination of nucleotides exists throughout the genome
and the presence of this combination most often not a start codon. This codon also codes
for the amino acid methionine. It is also possible that the three nucleotides can exist in
this pattern by fortune. For example one codon may be TAT and the next GCC. This
combination also has the consecutive nucleotides ATG. Finally, this combination can also
exist in a non-coding region. There are other start codons that are possible: GTG and
TTG. Even rarer are ATT and CTG. There are three codons that are considered to be stop
codons: TAG, TGA, and TAA. However, not all coding regions end with a stop codon.
For a contiguous coding region the number of nucleotides between a start and stop
codon should be a multiple of three since there are three nucleotides in a codon. However,
if the gene is constructed from splice then there are intron regions without any restriction
on the number of nucleotides.
The non-coding regions between the genes are not necessarily random either. There
are many regions in which the DNA sequence repeats. The number of nucleotides that
compose a repeating segment varies, the number of repeats vary, and the pattern of the
repeat can also vary. Since these regions are not used in creating genes, mutations are
not devastating to the host. So, these regions are less conserved through evolution. A
mutation occurs when a nucleotide in a child’s DNA has been changed from the parent’s.

211
Figure 16.4: Codon to Amino Acid Conversion

212
Figure 16.5: Spliced segments of the DNA are used to create a single protein.

16.2 Application: Checking Genes

As stated the length of a non-spliced coding region should be a multiple of three and that
this coding region should begin with a start codon and end with a stop codon. Since
bacteria rarely have spliced genes such a genome can be examined. The goal of this
application is to inspect every gene in a genome and capture those that do not have a
length that is a multiple of three or the correct start and stop codons.
The file used is from the Genbank repository and is identified uniquely by an acces-
sion number. These are detailed in Chapter 18. For this application the data has been
extracted from the Genbank file and stored in two files:

ˆ data/AE017199dna.txt contains the DNA string.


ˆ data/AE017199bounds.txt contains the start and stop locations of the strings.

16.2.1 Reading the DNA File

The first file contains the DNA string which has over 490,000 characters. The second file
is a tab delimited file with three columns. This file can be imported by a spreadsheet to
view. The first column is a start location, the second column is the stop location and the
third column is a complement flag. The first row in this file has three values: 883, 2691
and 0. The last value is either 0 or 1 and in this case the 0 indicates that this is not a
complement string. The beginning of the string is at location 883. However, there will be
a discrepancy since the Genbank file starts counting at 1 and Python starts counting at
0. So, after Python has read the string from the file the starting location will actually be
882.
Reading the DNA file is simple as shown in Code 16.1 which shows the LoadDNA
function. It simple reads the text file and returns the contents.
Code 16.2 shows the call to this function in line 2. This is a very long string as
confirmed in line 4. Therefore, the whole string should never be printed to the console.
Users of the IDLE interface will quickly learn that attempting to print such long strings

213
Code 16.1 The LoadDNA function.
1 # simpledna . py
2 def LoadDNA ( dnaname ) :
3 dna = open ( dnaname ) . read ()
4 return dna

will bring the interface to a crawl. Line 5 shows that the loading of the file can be confirmed
by printing out a much smaller portion.

Code 16.2 Using the LoadDNA function.

1 >>> import simpledna


2 >>> dna = simpledna . LoadDNA ( ' data / AE017199dna . txt ' )
3 >>> len ( dna )
4 490885
5 >>> dna [:100]
6 ' tctcgcagagttcttttttgtattaacaaacccaaaacccatagaatttaatga
7 acccaaaccgcaatcgtacaaaaatttgtaaaattctctttcttct '

16.2.2 Reading the Bounds File

Reading the second file requires a bit more programming as it is more than just a single
string of data. Rather this process is that of reading a tab delimited spreadsheet as shown
in Section 8.5.2. Code 16.3 shows the LoadBounds function which reads in the entire file
as a string in line 4. The outer loop started in line 8 considers each row of data and the
inner loop started in line 9 considers each of the three entries in that row. These entries
are converted to integers and appended as a tuple to the list bounds.
The function is called in line 17. To ensure that the read was successful the length
of the list and the first item in that list is printed. So, now both the DNA string and the
information about the locations of the genes has been loaded.

16.2.3 Examining the Data

Line 21 of Code 16.3 shows that the coding region starts at location 883 in the Genbank
data. Since Python starts indexing at 0 instead of 1 the location of the start of the gene
in the DNA string is actually 882. Line 1 in Code 16.4 computes the length of the gene
which is 1809. Line 4 shows that this is divisible by 3 which passes one of the tests for a
gene.
The first codon is printed in line 6 and the last codon is printed in line 8. These
do qualify as a start and stop codon respectively. So, this has the three qualities that are

214
Code 16.3 The LoadBounds function.
1 # simpledna . py
2 def LoadBounds ( boundsname ) :
3 fp = open ( boundsname )
4 rawb = fp . read ()
5 fp . close ()
6 bounds = []
7 bdata = rawb . split ( ' \ n ' )
8 for i in range ( len ( bdata ) ) :
9 if len ( bdata [ i ] ) > 1:
10 start , stop , cflag = bdata [ i ]. split ( ' \ t ' )
11 start = int ( start )
12 stop = int ( stop )
13 cflag = int ( cflag )
14 bounds . append ( ( start , stop , cflag ) )
15 return bounds
16

17 >>> bounds = simpledna . LoadBounds ( ' data / AE017199bounds . txt ' )


18 >>> len ( bounds )
19 535
20 >>> bounds [0]
21 (883 , 2691 , 0)

Code 16.4 Length of a gene.

1 >>> 2691-882
2 1809
3 >>> 1809 % 3
4 0
5 >>> dna [882:885]
6 ' atg '
7 >>> dna [2688:2691]
8 ' taa '

215
sought.
Some of the genes are complements and so it is necessary to convert them to the
complementary string before analysis can be executed. The genbank module does have
a Complement function that can perform this conversion. This function is detailed in
Chapter 18. Code 16.5 imports this module in line 1.

Code 16.5 Considering a complement string.

1 >>> import genbank as gb


2 >>> bounds [1]
3 (2668 , 3189 , 1)
4 >>> cut = dna [2267:3189]
5 >>> comp = gb . Complement ( cut )
6 >>> comp [:3]
7 ' atg '
8 >>> comp [-3:]
9 ' ttt '

The second gene in the data is a complement. Line 3 in Code 16.5 shows that the
last item is a 1 which indicates that this is a complement. The coding portion for this
gene is extracted to a string named cut in line 4. The complement is computed in line 4.
As seen the first codon of comp is a start codon and the last codon is a stop codon.
The function CheckForStartsStops in Code 16.6 performs the three checks on
all genes. The inputs are the DNA string and the list of bounds. Line 4 creates the list
named bad which will capture information about any gene that does not pass the tests.
Information about the first gene is obtained in line 6 and the string cut is the coding region
for a single gene. If the complement flag is 1 then line 9 will be used which computes the
complement of the gene. Line 10 determines if the string length is a multiple of three.
This computes the modulus and if m3 is 0 then the length is a multiple of 3.
The start and stop codons are extracted in lines 11 and 12. Line 13 begins a long if
statement. The backslashes at the end of lines 13 and 14 indicate that the line continues
to the next line. This complicated structure determines if the gene does not have a start
codon, stop codon or the length is not a multiple of 3. If any condition fails then line 18
is used and the list bad gets an entry.
Code 16.7 calls CheckForStartsStops and returns a list that contains all genes
that failed the tests. As seen this list has 0 entries and therefore all genes in this bacteria
have a length that is a multiple of 3 and a proper start and stop codon.

216
Code 16.6 The CheckForStartsStops function.

1 # simpledna . py
2 def C he c k Fo r S ta r t sS t o ps ( dna , bounds ) :
3 N = len ( bounds )
4 bad = []
5 for i in range ( N ) :
6 start , stop , cflag = bounds [ i ]
7 cut = dna [ start-1: stop ]
8 if cflag :
9 cut = gb . Complement ( cut )
10 m3 = ( stop-( start-1) ) % 3
11 startcodon = cut [:3]
12 stopcodon = cut [-3:]
13 if m3 ==0 and ( startcodon == ' atg ' or startcodon == ' gtg ' \
14 or startcodon == ' ttg ' ) and ( stopcodon == ' tag ' or \
15 stopcodon == ' taa ' or stopcodon == ' tga ' ) :
16 pass
17 else :
18 bad . append ( ( start , stop , cflag ) )
19 return bad

Code 16.7 The final test.


1 >>> bad = simpledna . C he c k Fo r S ta r t sS t o ps ( dna , bounds )
2 >>> len ( bad )
3 0

217
Problems

1. In the file provided, write a Python script to load the DNA and then count the
number of ATG’s that exist in the data.

2. In the file provided, write a Python script that gathers in a list the location of all of
the ATG’s.

3. In the file provided, write a Python script to gather all of the codons that immediately
precede the ATG’s.

4. Using the spreadsheet, find the longest gene.

5. Write a Python script to load the spreadsheet data and find the shortest gene.

6. Create a dictionary in which the keys are the codons and the values are the associated
amino acid. Write a Python script to convert the first gene from the list of DNA to
a list of amino acids using this dictionary.

218
Chapter 17

Application in GC Content

Some regions in the DNA are rich in cytosine and guanine.. These are called GC rich
regions. This chapter will explore methods to explore for these regions.

17.1 Theory

The GC content is measured as the number of G’s and C’s over a finite window length,
W,
NC + NG
ρ= , (17.1)
NA + NG + NC + NT
where Nk is the count of nucleotide k over this window. In most cases the denominator
will also be the window size, but there are cases were a nucleotide is known to exist at
a location but the identification of that nucleotide has been difficult to achieve. In those
cases the denominator may be smaller than the window size.
The computation is performed over a window that slides along the DNA sequence
as shown in Figure 17.1. In this example the window width is 8 and so it includes 8
nucleotides. In the first window there are 3 G’s and 2 C’s, and so the value of the GC
content is ρ = 58 . The step value is 4 and in the next time step the window is moved 4
places to the right and the computation is repeated and it also produces a value of ρ = 58 .
In the third time step ρ = 48 , and in the last time step ρ = 85 . In a real application both
the window and the step sizes are much larger.

17.2 Python Script

The concept is easy to implement in Python code as shown in Code 17.1. The function
GCcontent receives a string of DNA named instr. There are also two additional
arguments that control the window size and the step size. Line 5 considers a substring

219
Figure 17.1: A sliding window with a width of 8 and a step of 4.

of the DNA which is named cut. This is converted to lowercase for further processing.
The next four lines count the number of occurrences of each nucleotide and the ratio is
computed in line 10 which is append to a list name answ and returned to the user.

Code 17.1 The GCcontent function.


1 # gccontent . py
2 def GCcontent ( instr , window =128 , step =32 ) :
3 answ = []
4 for i in range ( 0 , len ( instr ) , step ) :
5 cut = instr [ i : i + window ]. lower ()
6 a = cut . count ( ' a ' )
7 g = cut . count ( ' g ' )
8 c = cut . count ( ' c ' )
9 t = cut . count ( ' t ' )
10 ratio = float ( c + g ) /( a + c + g + t )
11 answ . append ( ratio )
12 return answ

A very simple example is shown in Code 17.2. Line 2 creates a string and line 3 calls
GCcontent. In this case the window size is 8 and the step size is 4. The string does have
a GC rich region towards the beginning but ends in a GC poor finale. These attributes
are reflected in the values returned by the function. As the window passes through the
GC rich region the value of ρ becomes much larger than 0.5, and as the window passes
through the GC poor region the value falls much lower than 0.5.

Code 17.2 Using the GCcontent function.

1 >>> import gccontent as gc


2 >>> data = ' g a t a c t c g a c t g c g c g c g t a g c a t g a t t c g a t a t a t a t a t '
3 >>> gc . GCcontent ( data ,8 ,4)
4 [0.5 , 0.625 , 0.75 , 0.75 , 0.5 , 0.375 , 0.375 , 0.25 , 0.0 , 0.0]

220
17.3 Application

There are three regions of interest in this application. These are:

ˆ Large non-coding regions,


ˆ Coding regions, and
ˆ Non-coding regions that precede a gene.

The bacteria mycobacterium tuberculosis has GC rich genes and therefore is a good
genome to use in this process. Two files accompany this experiment. The first is data/nc000962.txt
which contains the DNA for the entire genome, and the second is data/ncc000962bounds.txt
which contains the start location, stop location and the complement flag. For this study
there are sufficient genes that the complements will not need to be considered.
Functions for reading these two types of files have already been used elsewhere.
See Codes 16.1 and 16.3. Line 2 in Code 17.3 loads the DNA string. This string has
over 4 million characters and so there should be no attempt to print the entire string
to the console. Line 5 loads the bounds data. This indicates that there are 3906 genes.
Each one has a start location, stop location and a binary value indicating if the gene is a
complement.

Code 17.3 Loading data for mycobacterium tuberculosis.

1 >>> import simpledna


2 >>> dna = simpledna . LoadDNA ( ' data / nc000962 . txt ' )
3 >>> len ( dna )
4 4411532
5 >>> bounds = simpledna . LoadBounds ( ' data / nc000962bounds . txt ' )
6 >>> len ( bounds )
7 3906
8 >>> bounds [0]
9 (1 , 1524 , 0)

In this application the following steps are considered:

1. Find locations of long non-coding regions.


2. Collect GC content values over these regions and compute the average and standard
deviation over these values.
3. Collect GC content over the non-complement genes. A more thorough study would
also use the complements, but as there are almost 4000 genes this is not required
here.
4. Compute the average and standard deviation over the GC content values in these
coding regions.
5. Collect GC content data for the 50 nucleotides that precede the coding region.

221
6. Compute the average and standard deviation over these values.
7. Compare the statistics for these three designated regions.

17.3.1 Non-Coding Regions

In this part of the application the goal is to obtain the GC content over large non-coding
regions. These regions are defined as beginning at the end of one gene to the beginning of
the subsequent gene. There are two caveats. The first is that the 50 bases preceding a gene
will not be considered since they will be considered in the third part of this application.
The second is that the regions must have a minimum length which is arbitrarily set to
128.
Function Noncoding shown in Code 17.4 receives the input DNA string and the
bounds data. Line 5 gets the end of one gene and line 6 gets the beginning of the next
gene. This distance needs to be at least 178 bases since the 50 bases in front of a gene
are to be excluded and the remainder needs to be at least 128 bases. The cut is the
non-coding DNA between these two regions and line 9 retrieves the GC content values
over a sliding window and puts these in a growing list.

Code 17.4 The Noncoding function.

1 # gccontent . py
2 def Noncoding ( indna , bounds ) :
3 answ = []
4 for k in range ( len ( bounds )-1) :
5 stop = bounds [ k ][1] # stop of first gene
6 start = bounds [ k +1][0] # start of next gene
7 if start-stop > 178:
8 cut = indna [ stop : start-50]
9 answ . extend ( GCcontent ( cut ) )
10 return answ

Gathering the average and standard deviation of these values is easily done as shown
in the function StatsOf shown in Code 17.5. The list of values is converted to a vector
and then the statistics are returned.

Code 17.5 The StatsOf function.


1 # gccontent . py
2 def StatsOf ( inlist ) :
3 vec = np . array ( inlist )
4 return vec . mean () , vec . std ()

Code 17.6 shows the operation and results. Line 1 gathers the GC content informa-

222
tion over the non-coding regions and line 2 returns the average and standard deviation.
As seen the GC content in the non-coding regions is actually quite a bit higher than 0.5.

Code 17.6 The statistics from the non-coding regions.

1 >>> a = gc . Noncoding ( dna , bounds )


2 >>> gc . StatsOf ( a )
3 (0.62839984195030407 , 0 . 0 8 0 3 4 1 9 7 7 7 2 2 9 3 5 5 1 2 )

17.3.2 Coding Regions

The second part of the application is to compute the same statistics over the coding
regions. Since there are plenty of genes the complements will not be considered.
Function Coding shown in Code 17.7 extracts the GC content values from sliding
windows over coding regions. Line 6 ensures that this coding region has a sufficient length
and is not a complement. Code 17.8 shows that there are over 60,000 such values extracted
and that the average is 0.656.

Code 17.7 The Coding function.

1 # gccontent . py
2 def Coding ( indna , bounds ) :
3 answ = []
4 for k in range ( len ( bounds ) ) :
5 start , stop , cflag = bounds [ k ]
6 if cflag == 0 and stop-start > 128:
7 answ . extend ( GCcontent ( indna [ start : stop ] ) )
8 return answ

Code 17.8 The statistics from the coding regions.

1 >>> a = gc . Coding ( dna , bounds )


2 >>> len ( a )
3 61356
4 >>> gc . StatsOf ( a )
5 (0.65595538924800023 , 0 . 0 6 2 5 4 0 4 9 5 2 0 8 7 4 8 8 9 5 )

17.3.3 Preceding Region

This part of the application is to consider the regions just in front of the coding regions.
Again the complements will not be considered since there is plenty of data.

223
Code 17.9 shows the Precoding function that extracts the GC content factors from
sufficiently long regions in front of non-complement genes. Line 8 ensures that the region
has at least 50 bases and is not a complement. Code 17.10 runs this test and extracts the
average and standard deviation.

Code 17.9 The Precoding function.

1 # gccontent . py
2 def Precoding ( indna , bounds ) :
3 answ = []
4 for k in range ( len ( bounds )-1) :
5 stop = bounds [ k ][1] # stop of first gene
6 start = bounds [ k +1][0] # start of next gene
7 cflag = bounds [ k +1][2]
8 if start-stop >50 and cflag ==0:
9 cut = indna [ start-50: start ]
10 answ . extend ( GCcontent ( cut ) )
11 return answ

Code 17.10 The statistics from the pre-coding regions.

1 >>> a = gc . Precoding ( dna , bounds )


2 >>> gc . StatsOf ( a )
3 (0.59666666666666512 , 0 . 0 9 4 8 6 9 9 7 1 4 5 5 5 7 9 6 9 6 )

17.3.4 Comparison

The final step is to compare the distributions of GC contents from the different regions.
This is accomplished by plot the Gaussian distributions for the three cases. These are
shown in Figure 17.2.
The distributions are relatively close which means that there is no drastic difference
between the regions. The smallest average corresponded to the precoding region and the
largest region corresponded to the coding region. In the search for coding regions in a
large genome the GC content fluctuation could be indicator.
It should also be noted that in this genome all averages are above 0.5. That means
that the entire genome is GC rich. This is not the case in other genomes. GC content is
another metric that can be used to compare contents of differing genomes as well.

224
Figure 17.2: Gaussian distributions of the three cases.

Problems

1. Does the size of the sliding window affect the gathered statistics? Repeat the GC content
measures for all three regions but use a sliding window that is have the size of the original.
Answer the question by comparing your results to those printed in Section 17.3.4.
2. Does the step size affect the gathered statistics? Repeat the GC content measures for all
three regions with a step size that is half of the original. Compare your results to those in
Section 17.3.4 to answer the question.
3. The previous chapter used data for AE017199. Compute the GC content over the three
regions for this genome and compare to the data in Section 17.3.4.
4. In the coding regions did the G or C dominate? For each gene compute the ratio G/C to
answer the question.
5. The coding regions consist of codons which are three nucleotides. Is the distribution of G’s
and C’s the same for all codon positions? To answer this question count the G’s and C’s for
each of the three positions in the codons for all of the genes.
6. Do the complement genes have a different distribution of GC content values? Compute the
GC content over the complement genes. Compare the average and standard deviation of
these values to the non-complement genes.

225
226
Chapter 18

DNA File Formats

Large databases of DNA information are being collected by several institutes. In the
US the large repository is Genbank hosted by the National Institutes of Health (http://
www.ncbi.nlm.nih.gov/Genbank/index.html). The concern of this chapter is to develop
programs capable of reading the files that are stored three of the most popular formats:
FASTA, Genbank, and ASN.1.

18.1 FASTA Files

The FASTA format is extremely simple but it contains very little information aside from
the sequence. A typical FASTA format is shown in Figure 18.1. The first line contains
a small header that may vary in content. In this case the accession number and name of
species and chromosome number are given. Some files may have comment lines after the
first line that being with a semicolon. The rest of the file is simply the DNA data.
Code 18.1 shows the commands needed to read in this file. The first version shown
opens the ‘NC 006046.fasta.txt’ (retrieved from [NC0, 2011]), reads the data, and closes
the file. The second version performs all three in a single command. The readlines
function will read all of the data and return a list. Each item in the list is a line of text
ending in a newline character. In the FASTA file there is a newline character at the end
of the header and one at the end of each line of DNA.

Figure 18.1: FASTA file example.

227
Code 18.1 Reading a file.

1 # version 1
2 >>> fp = open ( ' data / nc_006046 . fasta . txt ' )
3 >>> a = fp . readlines ()
4 >>> fp . close ()
5 # version 2
6 >>> a = open ( ' data / nc_006046 . fasta . txt ' ) . readlines ()

Code 18.2 shows the first few elements in the least. Lines 1-3 show the header
information. The rest of the items in list a are the lines of DNA characters.

Code 18.2 Displaying the contents.

1 >>> a [0]
2 ' > gi |50428312| ref | NC_006046 .1| Debaryomyces hansenii CBS767
chromosome D ,
3 complete sequence \ n '
4 >>> a [1]
5 ' CCTCTCCTCTCGCGCCGCCAGTGTGCTGGTTAGTATTTCCCCAAACTTTCTTCGAAT
6 GATACAACAATCA \ n '
7 >>> a [2]
8 ' CACATGACGTCTACATAGGAGCCCCGGAAGCTGCATGCATTGGCGGCTGATGCGTCA
9 GTGCCAGTGCTCA \ n '

As can be seen each line ends with the newline character \n. So, the only tasks
remaining are to combine all of the DNA lines into a long string and to remove the
newline characters. Combining strings in a list is performed by the join function (see
Code 6.38). The join function combines all but the first line of data and the empty quotes
indicates that there are no characters in between each line of DNA. Code 18.3 joins the
strings and removes the newline characters.

Code 18.3 Creating a long string.

1 >>> dna = ' ' . join ( a [1:] )


2 >>> dna = dna . replace ( ' \ n ' , ' ' )

In this case the DNA string is 1,602,771 characters long. Basically, it takes only
three lines of Python code to read a FASTA file and extract the DNA. In actuality it
could only take one line as shown in Code 18.4. However, such code does not increase the
speed of the program and is much more difficult to read, so it should actually be avoided.

228
Code 18.4 Performing all in a single command.

1 >>> dna = ( ' ' . join ( open ( ' data / nc_006046 . fasta ' ) .
2 readlines () [1:]) ) . replace ( ' \ n ' , ' ' )

Figure 18.2: Genbank file example.

18.2 Genbank Files

Genbank files are text-based files that contained considerably more information than
FASTA. Genbank files contain information about the source of the data, the researchers
that created the file, the publication where it was presented, the DNA, the proteins, repeat
regions, and more. However, some of these items are optional and not every file contains
every possible type of data.
Genbank files are text files and can be viewed with text editors, word processors, or
even the IDLE editor. It is worth the time to load a file and examine its contents.

18.2.1 File Overview

Figure 18.2 shows the first few lines a Genbank file (accession NC 006046). The first four
lines display the locus identification, the definition of the file, the accession number and
the version. As can be seen the capitalized keywords are following by the data and each
entry ends with a newline.
As there are many items in this file this chapter will not develop code to extract all
of them. Instead code will be developed to extract the most important items which will
demonstrate how the rest of the items can be extracted. While it is possible to develop code
to completely automate the entire reading process a different approach is adopted here. It
is highly possible that user only wants a small part of the file (just the DNA information
for example) and so functions will be built to extract the individual components. These
functions can be called individually or the user could easily build a driver program to call
the desired functions.
The ReadFile function is shown in Code 18.5. Line 3 opens a file from the given
file name and Line 4 reads the data. Line 6 returns the contents of the file as a single long
string. Line 8 shows an example call to the function.

229
Code 18.5 The ReadFile function.
1 # genbank . py
2 def ReadFile ( filename ) :
3 fp = open ( filename )
4 data = fp . read ()
5 fp . close ()
6 return data
7

8 >>> data = gb . ReadFile ( ' data / AB001339 . gb . txt ' )

Figure 18.3: Genbank file example.

18.2.2 Parsing the DNA String

The DNA information is the last entry in the file although it consumes more than half of
the file. In this example the DNA information starts around line 15,394 of this file which
contains 42,110 lines of text. The first four lines at the beginning of the DNA section and
the final four lines are shown in Figure 18.3.
The word ‘ORIGIN’ begins the DNA section and each line contains six sections of
10 bases. The last line may be incomplete and the final line of the file is two slashes. In
order to extract the DNA several steps are necessary. First, this information needs to be
taken from the file. Second the line numbers need to be removed. Third, the groups of 10
bases need to be combined into a long string.
There are many functions in the genbank module and so they are not reprinted in
this chapter. Only the calls to the functions are shown. However, readers should feel free
to examine the codes at their leisure.
The function ParseDNA extracts the DNA string from the file and removes the
first column and blank spaces. IT returns a long string of just the DNA characters as seen
in Code 18.6. Usually, these strings are very long, including this example which is over 3
million characters. These should not be printed to the console in their entirety. However,
it is possible to print just a portion.

230
Code 18.6 Calling the ParseDNA function.

1 >>> dna = gb . ParseDNA ( data )


2 >>> len ( dna )
3 3573470
4 >>> dna [:10]
5 ' ggcgcgccat '

18.2.3 Keywords

Consider the data in this file starting at line 60 shown in Figure 18.4. It indicates that
there is a gene which begins at location 2657 and ends at location 3115. This particular
gene is on the opposing strand of the double helix and so the data in this file is the
complement of the gene. This is actually an mRNA and other annotations are provided.
This is not the complete list of information that is available. Some files will list genes
and their protein translations for example. This optional information will be explored in
a later section.

Figure 18.4: Information on an individual gene.

This section is concerned with the ability to identify the location of the gene in-
formation in the file. Obviously the information begins with the keyword gene and so it
should be identified. In this file the keyword mRNA is used but it other files there are dif-
ferent keywords depending on the type of data. Some files indicate repeating regions, gaps,
etc. Thus, it is necessary to find any type of keyword and then extract the information
following it. Words used as keywords may also be used elsewhere in the file. For example
‘gene’ is commonly found in other locations. The keywords in the file are preceded by
five space characters and then by several space characters depending on the length of the
keyword. When the word ‘gene’ is used elsewhere in the file is not preceded and followed
by multiple spaces. The default keyword should be ‘ CDS ’ or ‘ gene ’ including the spaces
before and after the characters.

231
The function FindKeywordLocs finds all of the locations of the keyword in the
data stream. The function can receive a second argument if the user wishes to change
the keyword. It returns a list of integers that are the locations of the keyword in the long
string named data. As seen in Code 18.7 there are 3169 such locations indicating that
this file has 3169 genes. Line 6 prints out the first 100 characters from the first location.
As seen it starts with spaces and CDS.

Code 18.7 Using the FindKeyWords function.

1 >>> keylocs = gb . FindKeywordLocs ( data )


2 >>> len ( keylocs )
3 3169
4 >>> keylocs [:10]
5 [2534 , 3235 , 3814 , 4382 , 5124 , 5977 , 6818 , 7759 , 8687 , 10033]
6 >>> data [2534:2634]
7 ' CDS <1..772\ n / gene =" ispB
"\ n / note =" ORF_ID : slr0 '

18.2.4 Extracting Genes

A gene is a coding region in the DNA. It has at least one starting location and an ending
location. The data may be in its complement form or the coding DNA may be composed
of splices. The code developed in this section will extract the locations of the coding
sequences and an indication if it is a complement. There is a small difference between
Python and Genbank indexing. The first DNA base in the Genbank file is at location 1
as shown in Code 18.6. Python, however, uses the index 0 for the first location. So, the
locations of the coding splices will differ from the Python strings by a value of 1.
The line that follows the keyword ‘ CDS ’ has a few different forms as shown in
Figure 18.5. The first example is merely a start and end location of the coding sequence.
The second example is a complemented string. The third demonstrates a splice in which
the coding sequence is found in two sections. The fourth is a complemented splice. The
fifth example shows multiple splice locations. The final example shows a ‘>’ symbol which
indicates that the exact location is not known. There is no limit on the number of splices
that a coding sequence can have. Thus, a function that is to extract the locations of the
coding region(s) for a single gene needs to be able to handle all of these situations.
For the purpose of extracting gene locations the complement flag only needs to be
noted as the actual complement operation is performed later. The beginning and ending of
a splice location are two numbers separated by two periods. When there is a complement
or a join the first and last splice will have a parenthesis.
In regions where there are splices the entry will start with the word ’join’ and then
a parenthesis. Inside of the parenthesis each splice is denoted with two numbers separated

232
Figure 18.5: Indications of complements and joins.

by two periods and the splices are separated by commas. The number of splices is the
number of two period combinations.
Several functions are needed to dissect the information. These are not shown here
but only reviewed. The first is EasyStartEnd which is called if the particular gene has
no splices. The Splices function is called if the gene has splices in it. Both functions
read the information just after the keyword and extract the locations of the splices. It is
necessary that both functions return the same format for their information. Thus, a gene
without splices must return the start and stop location as a single splice.
The output of both functions is a list for a single gene. This list has two items. The
second is a binary flag for a complement. If the gene is a complement then this flag is
True. The first item is list of splices. Each splice is a tuple of start and stop locations. A
gene with no splices is still encoded as a list with a single tuple inside.
The function that the user calls is GeneLocs . This will receive the output from
FindKeywordLocs and the output is a list of lists. The call is shown in line 1 of Code
18.8. As seen the number of items in the list is the number of genes. The first item is
shown and it is not a complement. It has a single start and stop location as well. Had it
been a splice then there would have been other tuples within the inner list. This file is a
bacteria and has no spliced genes.

Code 18.8 Results from GeneLocs.


1 >>> genelocs = gb . GeneLocs ( data , keylocs )
2 >>> len ( genelocs )
3 3169
4 >>> genelocs [0]
5 ([(1 , 772) ] , False )

233
18.2.5 Coding DNA

Another keyword that is used is ‘complement’. If the flag compf is True then the comple-
ment of the DNA string needs to be computed. This is accomplished by swapping ‘T’ and
‘A’ as well as swapping ‘C’ and ‘G’. Finally, the string needs to be reversed. Code 18.9
shows the Complement function which swaps the letters and reverses the string.

Code 18.9 The Complement function.

1 # genbank . py
2 def Complement ( st ) :
3 table = st . maketrans ( ' acgt ' , ' tgca ' )
4 st1 = st . translate ( table )
5 st1 = st1 [::-1]
6 return st1

The coding DNA is extracted by the GetCodingDNA function which is used in


Code 18.10. This receives the DNA string and one of the items from the list genelocs.
In this example the first item is used. It returns a string that is teh DNA just from the
coding region. If it is a complement then the complement is returned. If it is splices then
the splices are joined together for a single string.

Code 18.10 Calling the GetCodingDNA function.

1 >>> cdna = gb . GetCodingDNA ( dna , genelocs [0] )


2 >>> len ( cdna )
3 772
4 >>> cdna [:10]
5 ' ggcgcgccat '

18.2.6 Extracting Translations

The Translation function retrieves the amino acid sequence for a given gene. Just after
the keyword are several entries of which one is designated as translation. Following this
is the amino acid sequence.
This function is shown in Code 18.11. It receives the data string and a single keyword
location. Line 3 searches for the word translation and then the following lines find the
beginning and ending of the amino acid strings. It removes newline characters in lines 6
and 7. Unix systems use a single character for a newline but MS-Windows systems use
two and so two lines of code are needed to remove these. Blank spaces are removed in line
8 and the return is a string with the amino acid sequence. The word translation has 11
characters and is preceded by a backslash. There is also an equals sign and a quote that

234
follows. Thus, there are 14 characters from the backslash to the beginning of the data.
Lines 4 and 5 reflect this distance with the numerals in the code.

Code 18.11 Using the Translation function.

1 # genbank . py
2 def Translation ( data , loc ) :
3 trans = data . find ( ' / translation ' , loc )
4 quot = data . find ( ' " ' , trans + 15 )
5 prot = data [ trans +14: quot ]
6 prot = prot . replace ( ' \ n ' , ' ' )
7 prot = prot . replace ( ' \ r ' , ' ' )
8 prot = prot . replace ( ' ' , ' ' )
9 return prot
10

11 >>> aaseq = gb . Translation ( data , keylocs [0] )


12 >>> len ( aaseq )
13 256
14 >>> 256*3
15 768
16 >>> aaseq [:10]
17 ' ARHRRLAEIT '

The information after a keyword can have several entries, but not all files have the
same type of data. These can be can read by writing a function similar to Translation
with modifications to lines 3, 4 and 5.

18.3 ASN.1

The ASN.1 format is another format that contains several different types of information
about the data. In this file information is encapsulated within curly braces. The first part
of a file is shown in Code 18.12. The sequence data starts with a ‘{’ and ends with a ‘}’
(which is not shown in the code). Inside of this set of braces are other items. Shown in
the code is the id data, and inside of it is other and general. There are several different
types of entries in this file but only the data will be examined here.
In this particular file the DNA data starts as shown in Code 18.13. The actual DNA
string starts just after ncbi2na, but the data is compressed to reduce file size.
For DNA there are only four different letters that are used (A,C,G,T) and thus it
is inefficient to store each letter as a single byte of information. Since there are only four
items it only takes two bits to encode them. Table 18.1 shows the encoding used in ASN1

235
Code 18.12 The opening lines of an ASN.1 file.

1 Seq-entry ::= seq {


2 id {
3 other {
4 accesssion " NC_006046 " ,
5 version 1} ,
6 general {
7 db " NCBI_GENOMES " ,
8 tag
9 id 435 } ,
10 gi 50428312 } ,
11 ...

Code 18.13 The DNA section in an ASN.1 file.


1 inst {
2 repr delta ,
3 mol dna ,
4 length 1606296 ,
5 ext
6 delta {
7 literal {
8 length 131072 ,
9 seq-data
10 ncbi2na ' 0369 F F D 5 5 D 5 7 F F F D 6 4 3 F E A A A A A A A A A A 0 0 2 A 4 0 0 0 1
11 5556 A 8 3 F C 0 5 5 5 D F 9 5 4 F D 2 A E E A 4 1 E B E A 8 2 A A 8 3 A A A A A 9 F F D B C F
12 C5505AA802AA3F9F42AA3C2FA80652ABFF52D4A6FE0018694
13 9501 F 8 3 9 D 8 A A 5 9 5 2 E E 7 8 0 2 0 4 A 8 1 2 E 3 C E 1 2 B 1 F 7 0 E B E 6 B A 0 D 8 E
14 8231 F 0 0 8 E A 8 5 F B 2 2 B C 7 3 E 3 4 0 D A 0 4 7 7 7 4 C 2 1 C 2 F D F C 0 C F 0 3 F 2 0
15 E3CFCFC834C3F30839376FB7FCC0BB02E77EF7F0C3312C417
16 7 FCFF3B44E22C2E28CC2008E72FD7C3001F38833F0FCEF3EC
17 B16978B8C2D3714AFE105E8D0642001D44CC514FD43D84C8B
18 A1F62D0FC1AC3FF8E8CCB18541D4F5B24190858E0824C48FF
19 D7C7FE2DF7C5D07FE31F6CEE7E70FF4C36D683EBF85840DF8
20 5 F87CF4F77412F9093E37A31E7DDCDB4DE4CFD83148

236
files. Thus, a string such as ATTG would be encoded as 00111110. This binary string is
converted to a standard hexadecimal string according to Table 18.2. The binary string
00111110 would then be converted to 3E.
Table 18.1: Binary representation of nucleotides.

DNA Code
A 00
C 01
G 10
T 11

Decoding the ASN.1 format is just the reverse of this process. The lookup table is
created as a Python dictionary as in function DecoderDict shown in Code 18.14. The
codes in this section are stored in the file asn1.py.

Table 18.2: Binary to hexadecimal.

Binary Hexadecimal Binary Hexadecimal


0000 0 1000 8
0001 1 1001 9
0010 2 1010 A
0011 3 1011 B
0100 4 1100 C
0101 5 1101 D
0110 6 1110 E
0111 7 1111 F

Code 18.14 The DecoderDict function.


1 # create a decoding dictionary
2 def DecoderDict ( ) :
3 ddct = {}
4 ddct [ ' 0 ' ] = ' AA ' ; ddct [ ' 1 ' ] = ' AC ' ; ddct [ ' 2 ' ] = ' AG '
5 ddct [ ' 3 ' ] = ' AT ' ; ddct [ ' 4 ' ] = ' CA ' ; ddct [ ' 5 ' ] = ' CC '
6 ddct [ ' 6 ' ] = ' CG ' ; ddct [ ' 7 ' ] = ' CT ' ; ddct [ ' 8 ' ] = ' GA '
7 ddct [ ' 9 ' ] = ' GC ' ; ddct [ ' A ' ] = ' GG ' ; ddct [ ' B ' ] = ' GT '
8 ddct [ ' C ' ] = ' TA ' ; ddct [ ' D ' ] = ' TC ' ; ddct [ ' E ' ] = ' TG '
9 ddct [ ' F ' ] = ' TT '
10 return ddct

Code 18.15 shows the reader function DNAFromASN1 and the call to it. In Line
7 the function finds ‘ncbi2na’ and then the single quotes that follow it. These quotes
surround the compressed DNA string. The string is extracted in Line 10 and the newlines

237
are removed in Line 11. Line 15 considers each letter in the string and uses the dictionary
to look up the conversion.

Code 18.15 The DNAFromASN1 function.


1 def DNAFromASN1 ( filename , ddct ) :
2 # read in data
3 fp = file ( filename )
4 a = fp . read ()
5 fp . close ()
6 # extract DNA
7 loc = a . find ( ' ncbi2na ' )
8 start = a . find ( " ' " , loc ) +1
9 end = a . find ( " ' " , start +2)
10 cpdna = a [ start : end ] # compressed dna
11 cpdna = cpdna . replace ( ' \ n ' , ' ' )
12 # decode
13 dna = ' '
14 for i in range ( len ( cpdna ) ) :
15 dna += ddct [ cpdna [ i ] ]
16 return dna
17

18 >>> dna = DNAFromASN1 ( ' c20 / nc_006046 . asn1 ' , ddct )


19 >>> dna [:100]
20 ' CCTCTCCTCTCGCGCCGCCAGTGTGCTGGTTAGTATTTCCCCAAACTTTCTTCGAAT
21 GATACAACAATCACACATGACGTCTACATAGGAGCCCCGGAAG '

The ASN.1 format also contains the locations of coding regions. One example is
shown in Code 18.16. This is also very easy to extract. By simply finding keywords such
as ‘location’, ‘from’, and ‘to’ the beginning and end of a coding region can be extracted.

18.4 Summary

DNA information is stored in several formats. Two of the most popular are FASTA and
Genbank. The FASTA files are very easy to read and this takes only a few lines of code.
The Genbank files are considerably more involved and store significantly more information
beyond the DNA sequence. They can store identifying information, publication and author
information, proteins, identified repeats and much more. Thus, reading these files requires
a bit more programming. These programs, however, are not complicated.

238
Code 18.16 DNA locations within an ANS.1 file..
1 ...
2 comment " tRNA Asp ( GTC ) cove score =60.37 " ,
3 location
4 int {
5 from 177641 ,
6 to 177712 ,
7 strand plus ,
8 id
9 gi 294657026 } ,
10 ...

Problems

1. Write a Python script that can extract all of the sequences from the file Synechocys-
tis.fasta.txt. The output of the function should be a list and each item in a list is a
string (without header information) for a single gene.

2. Write a Python function that can extract the protein id information from a Gen-
bank file.

239
240
Chapter 19

Principle Component Analysis

Data generated from experiments may contain several dimensions and be quite compli-
cated. However, the dimensionality of the data may far exceed the complexity of the
data. A reduction in dimensionality often allows simpler algorithms to effectively ana-
lyze the data. The most common method of data reduction in bioinformatics is principal
component analysis.

19.1 The Purpose of PCA

Principal component analysis (PCA) is an often used tool that reduces the dimensionality
of a problem. Consider the following three vectors,
~x1 = {2, 1, 2}
~x2 = {3, 4, 3}. (19.1)
~x3 = {5, 6, 5}
Each vector is in three dimensions, R3 , and therefore a three-dimensional graph would be
needed to plot the data. However, the first and third elements are the same in each vector.
The third element does not have any new information, in that if the first element is known
then the third element is exactly known. Even though the data is in three dimensions the
information contained in the data is in, at most, two dimensions.
Of course, this can be expanded to larger problems. Quite often a single biological
experiment can produce a lot of data, but due to time and costs, only a small number of
experiments can be run. So their are few data vectors that have a lot of elements. The
dimensionality of the data is large, but the dimensionality of the information is not. So,
PCA is a very useful tool that reduces the dimensionality of the data without damaging
the dimensionality of the information.
Conceptually, PCA is not a difficult task as it merely rotates and shifts the coordi-
nates to provide an optimal view of the data. Consider the two dimensional data shown in

241
Figure 19.1(a). In this example, there are five vectors each with a dimension of two. The
PCA algorithm will shift the data so that the average is located at the center of the coor-
dinate system and then rotate the coordinate system to minimize the covariance between
data data in different coordinates. This is explained in more detail subsequently. Figure
19.1(b) shows the old coordinate system (the lines at an angle) and the new coordinates
system. Figure 19.1(c) shows the data after the transformation.

(a) The original data in R2 . (b) Rotating the coordinate (c) The same data in a
system. new coordinate system.

Figure 19.1: Rotating data to remove one of the dimensions.

The property of the data is that it is centered in the coordinate system and the
covariance is minimized. In this case, that minimization found a rotation in which one of
the axis is no longer important. All of the data has the same y value and therefore only
the x axis is important. The two dimensional data has been reduced to one dimension
without loss of information, as the points still have the same relative position to each
other.
The dimensionality can be reduced when one coordinate is very much like another or
a linear combination of others. This type of redundancy becomes evident in the covariance
matrix which has the ability to indicate which dimensions are dependent on each other.

19.2 Covariance Matrix

PCA minimizes the covariance within a data set, and this information is contained within
a covariance matrix. This matrix contains information about the relationships of the
different elements in the input vectors which is also information about the proper view of
the data.

19.2.1 Introduction to the Covariance Matrix

Consider the data in Figure 19.2 which consists of four data vectors each with five elements.
The standard deviation (σ 2 ) and variance (σ) of each column are shown. The variance
indicates the spread of the data from the mean value.

242
Figure 19.2: A small data set.

The variance, however, only provides information for the elements individually as
the variance in the first columns is not influenced by the data in the other columns. The
purpose of the covariance is that it relates one column to another. Basically, if the data
in two columns are positively correlated (when one goes up in value so does the other)
then the covariance is positive. If the data in the two columns are negatively correlated
then the covariance is negative. If the data in the two columns are independent then the
covariance should be zero.
The covariance is defined as,

ci,j = (~xi − µ
~ ) · (~xj − µ
~) , (19.2)

where µ~ is the mean of all of the data vectors and the elements ci,j define the covariance
matrix C. The covariance value c(1, 3) links the data in column 1 with the data in column
3 as shown in Figure 19.3.

Figure 19.3: Linking data in two columns.

Consider a case of 1000 random valued vectors of length 5. Since the data is random
there are no links between the different elements. Thus, the covariance values should be
close to 0. Code 19.1 shows the creation of this data in line 1. Line 2 uses the cov function
to compute the covariance matrix. The diagonal elements relate to the variances of the
individual elements, whereas the off-diagonal elements relate to the covariances. As seen

243
the off-diagonal elements are much closer to 0 than are the diagonal elements. Such data
is considered to be independent as activity in one element is not related to activity in the
other elements.

Code 19.1 The covariance matrix of random data.


1 >>> import numpy as np
2 >>> a = np . random . ranf ( (1000 ,5) ) *10
3 >>> np . cov ( a . transpose () )
4 array ([[ 8.477 , -0.105 , 0.322 , -0.061 , -0.251] ,
5 [-0.105 , 8.224 , 0.074 , 0.256 , 0.165] ,
6 [ 0.322 , 0.074 , 8.174 , 0.11 , -0.486] ,
7 [-0.061 , 0.256 , 0.11 , 8.002 , 0.229] ,
8 [-0.251 , 0.165 , -0.486 , 0.229 , 8.811]])

A second example is shown in Code 19.2. In this case, the third column is somewhat
related to the first column from the code in line 1. This is slightly different than the data
in Equation (19.1) in that the two columns are not exactly the same but they are related.
The covariance matrix is computed and as seen the off-diagonal elements for C1,3 and
C3,1 are much larger than the other off-diagonal elements indicating that there is a strong
relation between the first and third elements of a vector. In fact, these value rival the
magnitude of the diagonal elements which indicates that this relationship is quite strong.
The fact that the elements are positive indicate that the first and third elements rise and
fall in value in unison.

Code 19.2 The covariance matrix of modified data.


1 >>> a [: ,2]= a [: ,0]+0.25* np . random . rand (1000)-.125
2 >>> np . cov ( a . transpose () )
3 array ([[ 8.477 , -0.105 , 8.474 , -0.061 , -0.251] ,
4 [-0.105 , 8.224 , -0.097 , 0.256 , 0.165] ,
5 [ 8.474 , -0.097 , 8.477 , -0.061 , -0.258] ,
6 [-0.061 , 0.256 , -0.061 , 8.002 , 0.229] ,
7 [-0.251 , 0.165 , -0.258 , 0.229 , 8.811]])

19.2.2 An Example

The covariance matrix of actual data provides insight into the inherent first-order rela-
tionships. Consider the case of a covariance matrix of the codon frequencies of a genome.
When creating a gene the DNA is considered in groups of three which are named codons.
Since there are four letters in the DNA string, there are 64 different combinations of three
letters. Thus, there are 64 different codons. A codon frequency vector is the frequency of
each codon in a single gene.

244
In this example, all of the genes of sufficient length from the genome of ureaplasma
parvum serovar (accession AF222894) are converted to codon frequency vectors. Genes
needed to be of sufficient length in order for the codon frequency vector to have meaning.
After this culling there were 560 codon frequency vectors and from these the covariance
matrix was computed. This created a 64 × 64 matrix which is too large to display as
numerical values. Instead the values are converted to pixel intensities and displayed in
Figure 19.4. This is a 64 × 64 image in which the brighter pixels indicate larger values.

Figure 19.4: Pictorial representation of the covariance matrix with white pixels representing the
largest values..

Regions that have bright pixels indicate that there is a positive covariance value. Of
course, there are positive values along the diagonal since those represent variances of a gene
with itself. The darker values indicate negative covariances where the popularity of some
codons is opposed in other genes. Each column (or row) in this image is associated with a
gene in the genome. Those columns with gray values indicate that the codon frequencies
of the associated gene are independent of the other genes. Those columns with many
bright or dark regions indicate that the associated gene has a frequency relationship with
the other genes.

19.3 Eigenvectors

The PCA computation will compute the eigenvectors and eigenvalues of the covariance
matrix and so this section reviews the theory of eigenvectors. The standard eigenvector-
eigenvalue equation is,
A~vi = µi~vi , (19.3)

245
where A is a square, symmetric matrix, ~vi is a set of eigenvectors and µi is a set of
eigenvalues where i = 1, ..., N and the matrix A is N × N . On the left hand side there is a
matrix times a vector and the result of that is a vector. On the right hand side is a scalar
times a vector which also produces a vector and, of course, the computations from both
sides must be equal. Thus, if the eigenvectors and values are known then the computation
on the right hand side is an easy way of finding the solution to the left hand side. This
equation produces a set of eigenvectors and eigenvalues. So, this equation is true of N
vectors and their associated values.
The numpy package provides an eigenvector solution engine. Code 19.3 creates a
matrix A that is square and symmetric (which emulates the type of matrices that will be
used in the PCA analysis). Line 9 calls the eig function to compute both the eigenvalues
and eigenvectors. Since A is 3 × 3 there are three values and vectors. The eigenvectors
are returned as columns in a matrix. Lines 18 and 19 show that Equation (19.3) holds for
the first eigenvalue eigenvector pair and similar tests would reveal that it also holds for
the other two pairs.

Code 19.3 Testing the eigenvector engine.

1 >>> import numpy as np


2 >>> np . set_printoptions ( precision =3)
3 >>> d = np . random . ranf ( (3 ,3) )
4 >>> A = np . dot ( d , d . transpose () )
5 >>> A
6 array ([[ 0.796 , 0.582 , 0.622] ,
7 [ 0.582 , 0.456 , 0.506] ,
8 [ 0.622 , 0.506 , 0.588]])
9 >>> evl , evc = np . linalg . eig ( A )
10 >>> evl
11 array ([ 1.774 , 0.062 , 0.004])
12 >>> evc
13 array ([[ 0.656 , 0.698 , 0.284] ,
14 [ 0.505 , -0.127 , -0.853] ,
15 [ 0.560 , -0.704 , 0.436]])
16 >>> np . dot ( evc [: ,0] , A )
17 array ([ 1.165 , 0.896 , 0.993])
18 >>> evl [0]* evc [: ,0]
19 array ([ 1.165 , 0.896 , 0.993])

If the matrix A is real-valued, square and symmetric then the eigenvectors are
orthonormal. This means that each vector has a length of 1 (ortho) and that each vector
is perpendicular to all of the other vectors (normal). This, in fact, is the definition of
a coordinate system. Code 19.4 shows a couple of examples. Line 1 computes the dot
product of an eigenvector with itself which is 1, indicating that the length is also 1. Line

246
3 computes the dot product of two different eigenvectors and since that value is 0 the two
vectors are known to be orthogonal to each other.

Code 19.4 Proving that the eigenvectors are orthonormal.

1 >>> np . dot ( evc [: ,0] , evc [: ,0] )


2 1.0
3 >>> np . dot ( evc [: ,0] , evc [: ,1] )
4 0.0

19.4 Principal Component Analysis

The logic of PCA (principal component analysis) is to diagonalize the covariance matrix.
In doing so, the elements of the data become independent. If there are first order relation-
ships within the data then this new representation will often display these relationships
more clearly than the original representation. Diagonalization of the covariance matrix is
achieved through mapping the data through a new coordinate system.
The protocol for PCA is,

1. Compute the covariance of the data.


2. Compute the eigenvectors and eigenvalues of the covariance matrix.
3. Determine which eigenvectors to keep.
4. Project the data points into the new space defined by the eigenvectors.

Consider the data set which consists of 1000 vectors in R3 . The distribution of data
is along a diagonal line passing through (0,0,0) and (1,1,1) with a Gaussian distribution
about this line centered at (0.5, 0.5, 0.5) with standard deviations of (0.25, 0.05, 0.25) in
the respective dimensions. Two views of the data are shown in Figure 19.5.

(a) y vs x. (b) z vs x.

Figure 19.5: Two views of the data set.

The first eigenvector is (-0.501, -0.508, -0.700) which defines a line that follows the

247
long axis of the data. This is shown in Figure 19.6. This is the axis that has the minimal
covariance is quite similar to the example shown in Figure 19.1.

Figure 19.6: First principal component.

Removing this component from the data is equivalent of viewing the data along the
barrel of that axis which is shown in Figure 19.7. Now the second and third axes can be
determined. Both are perpendicular to the first and to each other. The second axis will
be along the longest distribution of this data and the third axis must be perpendicular to
it. Each axis attempts to accomplish the feat shown in Figure 19.1.

Figure 19.7: Second and third principal components.

PCA uses eigenvectors to find the axes along the data distributions and in doing
so tend to diagonalize the covariance matrix. It should be noted that these axes are
dependent solely on first-order information. Higher order information is not detected
which is discussed in Section 19.6.

19.4.1 Selection

The computation will compute N eigenvectors where N is the original number of dimen-
sions. So, at this stage there is no reduction in the number of dimensions. The eigenvalues

248
indicate which eigenvectors are the most important and are usually computed in order
of eigenvalue magnitude. A typical plot of eigenvalues is shown in Figure 19.8 where the
y axis the magnitude of the eigenvalues. Those eigenvalues that are small are related to
eigenvectors that are less important and it is these eigenvectors that can be discarded.
The choice of how many eigenvectors to keep is up to the user and that is based on how
sharply the curve bends and how much error the user can allow.

Figure 19.8: The first 20 eigenvalues.

Some computational systems like Matlab returns the eigenvalues in order of magni-
tude. This is not necessarily so in Python. The computation naturally tends to produce
the eigenvectors and eigenvalues in that order, but in some cases this is not so. So, it is
important to look at the values of the eigenvalues before making the selection of which
eigenvectors to keep.

19.4.2 Projection

The new coordinate system is defined as the eigenvectors that are kept. Once the new
coordinate system is defined it is necessary to map the data to the new system. Since
dot products are also projections they are used to perform the mapping. For a single
data vector the location in the new coordinate system is the dot products with all of the
eigenvectors,
zi = ~vi · ~x, ∀i. (19.4)

Here the i-th eigenvectors is ~vi and ~x represents one of the data vectors. The output is a
vector ~z which is the location of the data point in the new space. This equation can be
applied to the data used in creating the covariance matrix as well as other data. So, once
the coordinate system is defined it is quite possible to place non-training data in the new
space.

249
19.4.3 Python Codes

All of the parts are in place and so the next step is to create a cohesive program. The PCA
function shown in Code 19.5 receivse the data matrix and the number of dimensions to
keep. The data matrix, mat, contains the original data in its rows. The covariance matrix
is computed in line 4 and the eigenvectors are computed in line 5. The coefficents, cff,
are the locations of the data points in the new space. The input D is the number of
dimensions that the user wishes to keep. The eigenvectors that are associated with the
D largest eigenvectors are kept in a matrix named vecs. These are used to compute the
location of the data points in line 14. This function returns the cffs matrix in which each
row is the new location of a data point and vecs are the eigenvectors that were used.

Code 19.5 The PCA function.


1 # dimredux . py
2 def PCA ( mat , D =2 ) :
3 a = mat - mat . mean (0)
4 cv = np . cov ( a . transpose () )
5 evl , evc = np . linalg . eig ( cv )
6 V , H = mat . shape
7 cffs = np . zeros ( (V , D ) )
8 ag = abs ( evl ) . argsort ( )
9 ag = ag [::-1]
10 me = ag [: D ]
11 for i in range ( V ) :
12 k = 0
13 for j in me :
14 cffs [i , k ] = ( mat [ i ] * evc [: , j ]) . sum ()
15 k += 1
16 vecs = evc [: , me ]. transpose ()
17 return cffs , vecs

The PCA function determines the new location of the data that was used in training.
However, data not used in creating the PCA space can also be projected into this new
space. This projection is similar to the projection of the training data as shown in in line
14 of Code 19.5. However, the computation of the eigenvectors are not required as the
eigenvectors from PCA will be used.
Code Projectshows the Project function which maps vectors into the new space.
The inputs are the eigenvectors returned from PCA and the new data vectors which
are stored in datavecs. This variable can be a tuple, list or matrix in which the data
is contained in the rows. The output is a new matrix named cffs which contains the
locations of only the data vectors that were in datavecs.

250
Code 19.6 The Project function.

1 # dimredux . py
2 def Project ( evecs , datavecs ) :
3 ND = len ( datavecs )
4 NE = len ( evecs )
5 cffs = np . zeros ( ( ND , NE ) )
6 for i in range ( ND ) :
7 a = datavecs [ i ] * evecs
8 cffs [ i ] = a . sum (1)
9 return cffs

19.4.4 Distance Tests

The projection of the points into a new space should not rearrange the points. The only
change is that the viewer is looking at the data from a different angle. Thus, the distances
between pairs of points should not change. This idea then makes a good test to determine
if the projection has changed the relationship among the data points. To demonstrate this
point the function AllDistances is used. This is shown in Code 19.7 and measures the
Euclidean distance between all pairs of vectors. In the case where there are N vectors the
2
number of pairs is N 2−N where N .

Code 19.7 The AllDistances function.


1 # dimredux . py
2 def AllDistances ( data ) :
3 answ = []
4 for i in range ( len ( data ) ) :
5 for j in range ( i ) :
6 answ . append ( np . sqrt ((( data [ i ]-data [ j ]) **2) . sum () )
)
7 answ = np . array ( answ )
8 return answ

Given a set of data which has 20 vectors each with 10 elements that have an average
value of 89 the data is mapped into PCA space as shown in line 1 in Code 19.8. Line 2
computes the distances between all pairs of points in the original data and line 3 computes
the same for all pairs of points in the PCA space. Since the PCA projection is merely
a shift and rotation none of the distances should change. Lines 4 and 5 show that the
maximum difference is a very small number that is below the precision of computation.
This shows that none of the distances between any pair of data points changed in the
projection.

251
Code 19.8 The distance test.
1 >>> cffs , vecs = dmr . PCA ( data , 10 )
2 >>> a = dmr . AllDistances ( data )
3 >>> b = dmr . AllDistances ( cffs )
4 >>> abs ( a-b ) . max ()
5 3.7400 65 19 62 11 61 71 e-06

19.4.5 Organization in PCA

Consider the image shown in Figure 19.9 which is from the Brodatz image set that has
been used a library for texture recognition engines. Each row in this image is considered
as a vector as an input to the PCA process. Line 2 of Code 19.9 loads this image as a
matrix. Lines 3 through 5 complete the PCA process using only the first two eigenvectors.

Figure 19.9: Image D72 from the Brodatz image set.

The original image is 640 × 640 thus producing 640 vectors in a 640 dimensional
space. The matrix ndata is the projection of that data to a two dimensional space. These
points are plotted in Figure 19.10. Each point represents one of the rows in the original
image. The top row is associated with the point at (-584, -66).
The original image had the quality that consecutive rows had considerable similar-
ity. This is evident in the PCA plot as consecutive points are nearest neighbors. The
line connecting the points shows the progression from the top row of the image to the
bottom. The clump of points to the far left are associated with the bright knothole in the
image. This feature of similarity leads to an example that demonstrates that most of the
information is contained within the first few dimensions of the PCA space.

252
Code 19.9 The first two dimensions in PCA space.

1 >>> fname = ' data / D72 . png '


2 >>> data = sm . imread ( fname , flatten = True )
3 >>> cv = np . cov ( data . transpose () )
4 >>> evl , evc = np . linalg . eig ( cv )
5 >>> ndata = np . dot ( data , evc [: ,:2] )

Figure 19.10: The points projected into R2 space.

253
In this example, the rows of the original information will be shuffled into a random
order. These shuffled vectors are then projected into PCA space. The next step will
find the nearest neighbors in the PCA space and use that information to reconstruct the
original image.
The image rows are shuffled by the ScrambleImage function shown in Code 19.10.
Line 5 creates a vector that are the indexes of the rows. These are shuffled in line 6 thus
creating a random order in which the rows will be arranged. The variable seedrow is
the new location of the first row of the image. This will be used to start the reassembly
process. The scrambled image is shown in Figure 19.11.

Code 19.10 The ScrambleImage function.

1 # pca . py
2 def ScrambleImage ( fname ) :
3 mgdata = sm . imread ( fname , flatten = True )
4 V , H = mgdata . shape
5 r = np . random . rand ( V )
6 ag = r . argsort ( )
7 sdata = mgdata [ ag ]
8 seedrow = list ( ag ) . index (0)
9 return sdata , seedrow

Figure 19.11: The scrambled image.

Each row is considered as a vector and the PCA process is used to remap these
vectors into a new data space. Line 1 in Code 19.11 scrambles the rows and line 2 projects
this scrambled data into a PCA space. The data points in this PCA are in the same

254
location as in Figure 19.10, but the lines can not be drawn between the points.

Code 19.11 The process of unscrambling the rows.

1 >>> sdata , seedrow = pca . ScrambleImage ( fname )


2 >>> ndata = pca . Project ( sdata )

Code 19.12 shows the function Unscramble which performs the reconstruction of
the image. The inputs are the scrambled data, sdata, the location of the first row of the
image in the scrambled data, seedrow, and the projected data, ndata. Currently, all 640
dimensions are contained in ndata but these will be restricted in subsequent examples.
The variable udata will become the unscrambled image and the first row is placed in line
5. The list unused maintains a list of rows that have not been placed in udata. So, the
first row is removed in line 7. The variable k will track which row is selected to be placed
into udata.

Code 19.12 The Unscramble function.


1 # pca . py
2 def Unscramble ( sdata , seedrow , ndata ) :
3 V , H = sdata . shape
4 udata = np . zeros (( V , H ) )
5 udata [0] = sdata [ seedrow ] + 0
6 unused = list ( range ( V ) )
7 unused . remove ( seedrow )
8 nndata = ndata + 0
9 k = seedrow
10 for i in range ( 1 , V ) :
11 dist = np . sqrt ((( nndata [ k ]-nndata [ unused ]) **2) . sum (1) )
12 ag = dist . argsort ()
13 k = unused [ ag [0]]
14 udata [ i ] = sdata [ k ]
15 unused . remove ( k )
16 return udata

Line 11 computes the Euclidean distance from a specified row to all of the other
unused rows. Thus, in the first iteration k = 0 and so this computes the distance to all
other rows. However, this is using the projected data. Basically, it is finding the closest
point in the PCA space shown in Figure 19.10. However, the plot shown in the figure
displays only 2 dimensions out of 640.
Line 12 finds the smallest distance and thus finds the vector that is closest to the k
row. The corresponding row of data is then placed in the next available row in udata and
the vector in PCA space is removed from further consideration in line 15.

255
In the first example, all 640 dimensions of the projected space are used. Thus, there
should be absolutely no loss of information. The call to the function is shown in line 1 of
Code 19.13. The output udata is an exact replicate of the original image.

Code 19.13 Various calls to the Unscramble function.


1 >>> udata = pca . Unscramble ( sdata , seedrow , ndata )
2 >>> udata = pca . Unscramble ( sdata , seedrow , ndata [: ,:7] )
3 >>> udata = pca . Unscramble ( sdata , seedrow , ndata [: ,:2] )

However, not all of the dimensions in the PCA space are required. Consider the plot
of the first 20 eigenvalues shown in Figure 19.8. When data is organized the eigenvalues
fall rapidly thus indicating the importance of each eigenvector. Commonly, the number of
eigenvectors to use is the location of the elbow in this curve.
Line 2 in Code 19.13 reconstructs the image using only 7 of the 640 eigenvectors.
The result is nearly perfect reconstruction with only a few rogue lines at the bottom of
the image. The result is shown in Figure 19.12. This used the data points projected into
an R7 space and then computed the Euclidean distances between the projected points in
that space. The few rows at the bottom were probably rows that were skipped during
reconstruction.

Figure 19.12: Reconstruction using only 7 dimensions.

Line 3 in Code 19.13 reconstructs the image using only 2 of the 640 eigenvectors. The
result is shown in Figure 19.13. As seen there are a few more errors in the reconstruction,
but most of the reconstruction is in tact. This is not a surprise since more than two
eigenvalues had significant magnitude in Figure 19.8. However, even with this extreme
reduction in dimensionality most of the image could be reconstructed. This indicates that

256
even in the reduction from 640 dimensions to 2 that there was still a significant amount
of information that was preserved. In some applications of PCA this loss of information
is not significant in the analysis that is being performed.

Figure 19.13: Reconstruction using only 2 dimensions.

19.4.6 RGB Example

Data for this example starts with the image in Figure 19.14. This image is 480 × 640 and
each pixel is represented by 3 values (RGB). The data set is thus 307,200 vectors of length
3. The task in this example is to isolate the blue pixels. At first this sounds like a simple
task which it is for humans. However, since the blue pixels have a wide range of intensities
performing this task with RGB data is not as simple.

Figure 19.14: An input image.

257
It is possible to contrive an equation that will attempt this isolation,
(
b b
1 g+1 > 1.5 and r+1 > 1.5
m= . (19.5)
0 Otherwise
The pixels isolated by this process are shown in Figure 19.15. The LoadRGBchannels
function in Code 19.14 loads the image and returns three matrices representing the red,
green, and blue components. The IsoBlue function performs the attempt at isolating the
blue pixels from Equation (19.5).

Figure 19.15: An attempt at pixel isolation.

Code 19.14 The LoadImage and IsoBlue functions.

1 # pca . py
2 def LoadRGBchannels ( fname ) :
3 data = sm . imread ( fname )
4 r = data [: ,: ,0]
5 g = data [: ,: ,1]
6 b = data [: ,: ,2]
7 return r ,g , b
8

9 def IsoBlue ( r ,g , b ) :
10 ag = b /( g +1.0) >1.5
11 ab = b /( r +1.0) >1.5
12 isoblue = ag * ab
13 return isoblue

The data is in R3 and this is shown in Figure 19.16(a) where the green points are
those isolated in Figure 19.15 and the red points are the other pixels. Figure 19.16(b)
show the first two axes of this plot. As seen there is a separation of the isolated pixels
from the others and thus finding a discrimination surface is possible. It should also be
noted that the green points in the plot are those in Figure 19.15 which is not solely the
blue pixels.

258
(a) Map in R3 . (b) Map in R2 .

Figure 19.16: Displaying the original data. The green points are those denoted in Figure 19.15.

The first two axes of the PCA projection of this data is shown in Figure 19.17. As
seen the plane that divides the two sets of data is almost horizontal. Recall, however, that
the green points are not actually the set of blue pixels but an estimate.

Figure 19.17: First two axes in PCA space.

The horizontal plane is around x2 = 0.45 and so the next step is to just gather those
points in the new space in which x2 ≥ 0.45 (where x2 is the data along the second axis).
Figure 19.18 shows the qualified points and clearly this is a better isolation of the blue
pixels.
The mapping to PCA space did not drastically change the data. It did, however,
represent the data in a manner such that a simple threshold (only one axis) could isolate
the desired data.

259
Figure 19.18: Points isolated from a simple threshold after mapping the data to PCA space.

19.5 Describing Systems with Eigenvectors

Consider a system that contains a state vector that is altered in time through some sort of
process such as the changes in protein population within a cell. Each element state vector
describes the population of a single protein at a particular time. As time progresses the
populations change which is described as changes in the state vector.
Eigenvectors are a useful tool for describing the progression of a state vector in an
easy to read format. In this case the state vector is v[t] and the machine that changes the
state vector is a simple matrix M. In reality the machine that changes the state vector
can be far more complicated than a linear transformation described by a single matrix.
The progress of the system is then expressed as,

~v [t + 1] = ~v [t] + M~v [t]. (19.6)

Code 19.15 runs the system for 20 iterations storing each state vector as a row in
data. The matrix M is forced to have a zero sum in Line 2 so that it does not induce
energy into the system.

Code 19.15 Running a system for 20 iterations.

1 >>> M = np . random . ranf ( (5 ,5) )


2 >>> M = M - M . mean (0)
3 >>> data = np . zeros ( (20 ,5) , float )
4 >>> data [0] = 2* np . random . rand (5)-1
5 >>> for i in range ( 1 , 20 ) :
6 data [ i ] = data [ i-1] + np . dot ( M , data [ i-1] )

This system contains 20 vectors and it is not easy to display all of the information.
The plot in Figure 19.19 shows the just of few of the data vectors. The first element
increases in value as time progresses. Some of the others increase and some decrease. Cer-

260
tainly, if the system contained hundreds of vectors and the relationships were complicated
then it would be difficult to use such a plot to understand the system.

Figure 19.19: The values of the variables in the system.

In this case the first two eigenvalues are used in computing the PCA space. The
resultant data is plotted in Figure 19.20. The 20 data points represent the state of the
system at the 20 time intervals. The first point cffs[0] is close to 0,0 and in this case
the system is seen to create an outward spiral.

Figure 19.20: The evolution of the system.

The outward spiral indicates that the values in the system are increasing in magni-
tude. If this were to continue then the values of the system would approach infinity. This
is an unstable system. An inward spiral would indicate that the system is tending towards
a steady state in which the state vector stops changing.
The most interesting cases are where the spiral does not expand outward or go
inward. The system draws overlapping circles (or other types of enclosed geometries).

261
This indicates that the system has obtained a stable oscillation. If the path exactly
repeats its path then the oscillations are exactly repeated. If the path stays within a finite
orbit then it describes a limited cycling of the system.
Code 19.16 generates a system in which values are not allowed to exceed a magnitude
of 1 and is plotted in Figure 19.21. It starts in the middle and quickly begins looping to
the left. This system was run for 1000 time and it gets into an oscillation. There is a
regular cycle that repeats about every 20 time steps. The hard corners appear because
the system forces values to be no greater than 1 and this is a nonlinear operation. The
corners occur when one of the elements of the state vector drastically exceeds 1 and the
nonlinear restriction is employed.

Code 19.16 Computing data for a limit cycle.

1 >>> data = np . zeros ( (1000 ,5) )


2 >>> for i in range ( 1 , 1000 ) :
3 data [ i ] = data [ i-1] + np . dot ( M , data [ i-1] )
4 mask = ( abs ( data [ i ]) > 1) . astype ( int )
5 data [ i ] = (1-mask ) * data [ i ] + mask * np . sign ( data [ i ])

Figure 19.21: A system caught in a limit cycle.

In a sensitivity analysis the cffs are just five data points in R2 space. The plot in
Figure 19.22 shows a set of +’s which represent the five dimensions of the first system.
The *’s represent the same five dimensions in the second system. The two data points
that moved apart the most are located around x = 5, y = −8. Printing the cffs it is seen
which variable this is. The conclusion to draw is that the change in the system affected the
second variable more than the others. Likewise, the change in the system barely affected
the first and fourth variables.

262
Figure 19.22: Sensitivity analysis of the data.

19.6 First Order Nature of PCA

Consider a case in which the data consists of several images of a single face at different
poses. In this case, the face is a computer generated face and so there are no factors such
as deviations in expression, hair, glasses, etc. Figure 19.23 shows data mapped to a PCA
space.
As seen the pose of the face gradually changes from left to right. A possible conclu-
sion is that the PCA was capable of mapping the pose of the face.
This would be an incorrect conclusion. PCA can only capture first order data. In
other words it can compare pixel (i, j) of the images to each other but not pixel (i, j)
with pixel (k, l). The reason that the faces sorted as they did in this example is more of
a function of the location of the bright pixels in the images. The idea of “pose” is not
captured and it is only that the face was illuminated from a single source that there was
a correlation between the pose and the regions of illumination.

19.7 Summary

The principal components are a new set of coordinates in which the data can be repre-
sented. These components are orthonormal vectors and are basically a rotation of the
original coordinate system. However, the rotation minimizes the covariance matrix of the
data and thus some of the coordinates may become unimportant. In this situation these
coordinates can be discarded and thus PCA space uses fewer dimensions to represent the
data than the original coordinate system.
Principal components can be computed using an eigenvector engine or singular val-
ued decomposition. The NumPy package offers both and the interface is quite easy.

263
Figure 19.23: PCA map of face pose images.

Eigenvectors are also used to explore the progression of a system. Limit cycle plots us-
ing eigenvectors indicate if the system is shrinking, expanding, or caught in some sort of
oscillation.

Problems

1. Given a set of N vectors. In this case the eigenvalues of this set turn out to be
1,0,0,0,... What does this imply?

2. Given a set of N vectors. In this case the eigenvalues of this set turn out to be 1, 1,
1, 1... What does this imply?

3. Given a set of purely random vectors, describe what you expect the eigenvalues to
be. Confirm your prediction.

4. Given a set of N random vectors of length D. Compute the covariance matrix. Com-
pute the eigenvectors. Compute the covariance matrix of the eigenvectors. Explain
the results.

5. Repeat the work to obtain Figure 19.21, but add ±5% noise to each iteration. Ex-
plain the new system plot.

264
Chapter 20

Codon Frequencies in Genomes

Codons are three nucleotides that are used to by the cell to determine which amino acid
to attach to a chain as the protein is created. There are 64 different codons but only 20
different amino acids which means that many amino acids have multiple associated codons.
It is therefore possible that some genomes favor one codon over another in the DNA when
producing a gene. If this is true then it is possible to classify genomes according to their
codon frequencies. This chapter will explore this concept and show that for bacteria this
classification is achievable.

20.1 Codon Frequency Vectors

Figure 16.4 shows the conversion from codons to amino acids. Each codon is a set of three
nucleotides and the DNA for a gene should be a length that is divisible by three.
To compute the codon frequencies the number of occurrences of each codon is ob-
tained and these counts are divided by the total number of codons. So for a single codon,
the frequency is,
ci
fi = , (20.1)
N
where ci is the number of times that codon i was seen and N is the total number of codons.

20.1.1 Codon Table

The first step in counting the codons is to create a list of all of the possible codons. Once
set the order should not be changed. The function CodonTable shown in Code 20.1
creates a list of strings which are the 64 codons.
The function is called in line 13, and it returns a list of 64 strings of which the first
4 are printed to the console. This is the complete list of codons.

265
Code 20.1 The CodonTable function.
1 # codonfreq . py
2 def CodonTable () :
3 abet = ' acgt '
4 answ = []
5 for i in range ( 4 ) :
6 for j in range ( 4 ) :
7 for k in range ( 4 ) :
8 codon = abet [ i ] + abet [ j ] + abet [ k ]
9 answ . append ( codon )
10 return answ
11

12 >>> import codonfreq as cf


13 >>> codons = cf . CodonTable ()
14 >>> len ( codons )
15 64
16 >>> codons [:4]
17 [ ' aaa ' , ' aac ' , ' aag ' , ' aat ' ]

20.1.2 Codon Counts

The next step is to count the number of codons in a string. This is performed in the
function CountCodons shown in Code 20.2. The inputs are a DNA string for a gene and
the codon list created by CodonTable. Line 3 gets the length of the input string and
line 4 creates a vector with 64 elements currently all set to 0. This will hold the counts
of the codons. Line 5 starts the loop which goes from the beginning to the end of the
string but stepping every three bases. Thus the index i is only at the beginning of each
codon. Line 6 extracts a single codon and line 8 finds out which position this codon is
in the codons list. The variable ndx is an integer that corresponds to the location of the
codon in codons. The first codon in the list is ‘aaa’ and so it cut were also ‘aaa’ then ndx
would be 0. Line 9 then increments the value in the vector for that position. In this way,
the vector begins to accumulate the number of times each codon appears. Line 12 calls
this function and the variable cts is a vector of 64 elements that are the counts of each
codon in the string dna.
Line 7 may seem unnecessary at first. However, there are other letters in a DNA
string other than A, C, G, or T. These letters indicate that a nucleotide does exist at the
position but it is not known as to what it is. So, line 7 makes sure that the codon consist
of only the four letters before it is counted.
It is also possible to create count as a list instead of a vector. However, to compute
the frequencies the counts will all be divided by a single value. Therefore, a vector is
a better choice for containing the data. The most important point is that the order of

266
Code 20.2 The CountCodons function.
1 # codonfreq . py
2 def CountCodons ( dna , codons ) :
3 N = len ( dna )
4 counts = np . zeros ( 64 )
5 for i in range (0 , N , 3 ) :
6 cut = dna [ i : i +3]
7 if cut in codons :
8 ndx = codons . index ( cut )
9 counts [ ndx ] += 1
10 return counts
11

12 >>> cts = cf . CountCodons ( dna , codons )

codons can not be changed in later processing or the counts will no longer correspond to
the correct codons.

20.1.3 Codon Frequencies

Computing the codon frequencies easily performed by dividing the counts by the total
number of counts. Code 20.3 shows the division of the vector by the sum of the vector in
line 1. This is the codon frequencies, and one property of a frequency vector is that the
sum is 1.0 which is shown to be true.

Code 20.3 Computing the codon frequencies.

1 >>> freqs = cts / cts . sum ()


2 >>> freqs . sum ()
3 1.0

Since this set of commands will be called multiple times it is prudent to create a
driver function. Code 20.4 shows the function CodonFreqs which does just that. It
creates the codon table, counts the codons and then computes the frequencies.

20.1.4 Frequencies of a Genome

A genome has several genes and the codon frequencies can be computed for each gene
that has a sufficient length. Short genes are not used because the frequency vector is
meaningless if there are only a few codons. For example, if the gene has less than 64
codons then it is impossible to get a frequency for all codons. So for this section the
minimum number of codons is 3 × 64 = 192.

267
Code 20.4 The CodonFreqs function.

1 # codonfreq . py
2 def CodonFreqs ( dna ) :
3 codons = CodonTable ()
4 cts = CountCodons ( dna , codons )
5 freqs = cts / cts . sum ()
6 return freqs

The frequency vectors for an entire genome are obtained by the GenomeCodon-
Freqs function shown in Code 20.5. The input is the Genbank file name. Line 3 reads
the data, line 4 obtains the entire DNA string, line 5 gets the keyword locations and line 6
obtains the location of all of the genes. Now the genome is read and ready to be analyzed.
In the for loop the variable g is one of the elements in the list glocs. Line 9 gets the
coding DNA for a single gene. If this length is greater than 192 then the codon frequencies
are computed and stored in a list named frqs. The call to this function is shown.

Code 20.5 The GenomeCodonFreqs function.

1 # codonfreq . py
2 def GenomeCodonFreqs ( fname ) :
3 data = gb . ReadFile ( fname )
4 dna = gb . ParseDNA ( data )
5 klocs = gb . FindKeywordLocs ( data )
6 glocs = gb . GeneLocs ( data , klocs )
7 frqs = []
8 for g in glocs :
9 cdna = gb . GetCodingDNA ( dna , g )
10 if len ( cdna ) >= 192:
11 f = CodonFreqs ( cdna )
12 frqs . append ( f )
13 return frqs
14

15 >>> import genbank as gb


16 >>> fname = ' Genbank / ae002161 . gb . txt '
17 >>> frqs = cf . GenomeCodonFreqs ( fname )

The number of frequency vectors must be the same or less than the number of genes.
In this case, there are 1110 genes and 1019 genes are of sufficient length. Just 91 genes
were too short to be used.

268
20.2 Genome Comparison

This section will compare the codon frequency distribution for two genomes.

20.2.1 Single Genome

A single genome has many genes and so there is a distribution of values for each codon
frequency. The function Candlesticks creates a file that will display the statistics of the
codon frequencies for an entire genome. The call to this function is shown in Code 20.6.
This function receives the list of frequency vectors and a name used to write the data to
a file. The third argument is 0 which is the amount of horizontal shift used in the plot.
This is used when plotting more than one genome on the same plot as seen in the next
Section. This file can be read by GnuPlot or spreadsheets which can then create the plots.

Code 20.6 Calling the Candlesticks function.

1 >>> cf . Candlesticks ( frqs , outname , 0 )

Figure 20.1: The statistics for an entire genome.

The results are shown Figure 20.1 which shows 64 different bars. Each bar has box
and whiskers. The extent of the box is the average plus and minus the first standard
deviation. Almost 70% of the frequency values fit within the range of the box. The extent
of the whiskers show the highest and lowest frequency values. The short bars correspond
to the codons that are very infrequent in this genome.

269
20.2.2 Two Genomes

Now that the procedure has been established comparing two genomes is straightforward.
Code 20.7 shows the process of comparing two genomes. Line 2 gathers the frequency
vectors for the first genome and line 3 creates the files suitable for plotting. Lines 5 and
6 repeat the process for a second genome. The last argument in line 6 is 0.3 which shifts
the plots of the second genome 0.3 units to the right. In this manner the two plots do not
overlap but are side-by-side. The result is shown in Figure 20.2. Only the first 20 of the
64 codon frequencies are shown for clarity. Otherwise, the plot becomes too dense to see
the details.
Code 20.7 Creating plots for two genomes.

1 >>> fname = ' Genbank / ae002161 . gb . txt '


2 >>> frqs1 = cf . GenomeCodonFreqs ( fname )
3 >>> cf . Candlesticks ( frqs1 , ' g1 . txt ' )
4 >>> fname = ' Genbank / nc_000961 . gb . txt '
5 >>> frqs2 = cf . GenomeCodonFreqs ( fname )
6 >>> cf . Candlesticks ( frqs2 , ' g2 . txt ' , 0.3 )

Figure 20.2: The statistics for the first 20 codons for two genomes.

The third codon shows that the two boxes have a very small overlap. This indicates
that the frequency of this codon is very different for the two genomes. Two other codons
in this view also have very little or no overlaps. This plot is showing less than 1/3 of the
total number of codons.
Thus, given a codon frequency vector randomly selected from the two genomes it is
possible to determine which genome it came from by examining the frequencies of a few

270
decision codons.

20.3 Comparing Multiple Genomes

The plot in Figure 20.2 shows only a part of the comparison of just two genomes. Com-
paring multiple genomes requires a different analysis technique. For this task, PCA will
be used as proposed by Kanaya et al.[Kanaya et al., 2001].
The protocol for this experiment is:

1. Gather the names of several bacterial genomes.


2. Compute the codon frequency lists for each genome.
3. Apply PCA to this collection of data.
4. Color code the data points in PCA space for each genome.

The result is shown in Figure 20.3. Each genome is assigned a different color.
The data was sufficiently organized in the PCA representation that each genome has its
own isolated territory. This indicates that codon frequencies are sufficient for classifying
bacterial genomes.

Figure 20.3: PCA mapping for several bacterial genomes.

Some genomes do overlap in this view. However, this only the first two PCA axis and
it is always possible that the groups that appear to overlap in this view are not actually
overlapping which can be seen in other views.
In this particular case, some of the clouds do overlap and no particular view will
contradict this point. This means that the two genomes are quite similar with respects to
their codon frequencies. This too is important information.

271
Projects

1. This chapter applies PCA analysis to the codon frequencies of bacteria. In this
project repeat the process for another set of genomes. For example, a project may
compare the codon frequencies for mammalian genomes.

2. This chapter reviewed the process for the coding regions of the bacterial genomes.
Genomes have evolved over time and so the non-coding regions are related to their
ancestors. Determine genomes can be separated by the PCA process for codon
frequencies in non-coding regions of the bacterial genomes.

272
Chapter 21

Sequence Alignment

DNA sequences are complicated structures that have been difficult to decode. A strand of
DNA contains coding regions which produce genes and contains non-coding regions which
may or may not have functionality. As systems evolve genes were passed on sometimes
with small alterations or relocations. Since the non-coding regions are less important in
many respects they were often passed on with more alterations. These similarities allow
us to infer functionality of a gene by relating it to other genes with known function.
The main computational technique for accomplishing this comparison is to align
sequences. The purpose of alignment is to demonstrate the similarity of two (or more)
sequences. At first this sounds like an easy job. Each sequence has only four bases and it
should not be too hard to determine if the sequences are similar.
Like in most real-world problems, it is not that easy. Two sequences can differ
because of base differences. Two sequences can differ by having extra or missing bases.
Computationally, this becomes a more difficult problem to solve since smaller chunks of
the sequences will need to align differently. Another problem is that in a DNA strand the
beginning and ending of coding regions may not be known. Thus, between two strands
the important parts may be similar and the unimportant parts could be dissimilar which
is perfectly acceptable and should not deteriorate the score of the alignment of coding
regions. Another problem is that parts of the coding regions can be located in different
regions in a strand. For example a gene may be constructed from two different subsections
of the strand. There is no guarantee that these two sections will be located in the same
regions of the two strands. Still, a computer program will need to find similarities among
these strings.
This chapter will consider simple alignment algorithms and review the highly used
dynamic programming approach. Other, more complicated, approaches will be discussed
but not replicated.

273
21.1 Simple Alignment

This section will begin the presentation of alignment techniques with a simple alignment.
Its many use is to define terms and show the deficiencies in believing that simple alignments
will be of much use.

21.1.1 An Alphabet

The algorithms contained here can be applied to strings from any source. Before they are
presented it is necessary to provide a few definitions. A string is an array of characters from
an alphabet. For the case of DNA the alphabet has only four characters (ACGT). Protein
sequences are made from an alphabet of 20 characters (ARNDCQEGHILKMFPSTWYV).
Certainly, strings from English text (26 letter alphabet) can be considered or from any
other language. Usually, the alphabet is represented with Σ, and for DNA the alphabet
is,
Σ = {A, C, G, T }. (21.1)

21.1.2 Considerations of Matching Sequences

The first step that needs to be considered is how to assign a score to an alignment. When
two letters align they should contribute positively to the score and mismatches should
contribute negatively.
A perfect match occurs when aligned letters from two strings are the same. Even
in this simplicity questions of measuring the quality of the match need to be considered.
In the following case two simple sequences are perfectly aligned. Should the measure of
alignment treat all of the letters equally? Is an alignment of a sequence AATT with itself
more important than the alignment of ACGT with itself?
These questions can be further complicated by the considering the function of the
DNA. Multiple codons code for the same amino acid so should the alignment of CGA with
CGG (which are the same amino acid) be different than CGA with GGA?

21.1.3 Insertions and Deletions

Insertions and deletions (indels) are cases in which a base is added or removed from a
sequence. When identified these are denoted by a dash as in the following case.

ACGT
AC-T

The indels can arise from biological causes in which an offspring actually removes or
inserts a base. Other times indels can be caused by difficulties arising from the sequencing
process. It is possible that a base was not called correctly or that the signal was too

274
weak/noisy/imperfect to call the base. In any case the alignment process needs to consider
the possibility of indels. In the previous case the computer program would receive two
strings ACGT and ACT and would have to figure out that the deletion of a G has occurred.
This is a serious matter. If the sequences are very long (perhaps thousands of
bases) then there are thousands of locations where the indel can occur. Furthermore, the
sequences may have several indels and at one location multiple indels may need to be
considered. For sequences of significant length it is not possible to consider all possible
indels in a brute force computing fashion.

21.1.3.1 Rearrangements

Genes are encoded within DNA strands but a single gene may be coded in more than one
region of the strand. Thus, coding regions can contain non-coding regions within their
boundaries. Consider an example which has a strand consisting of xNxMx where x is a
non-coding region and the N and M are coding regions. It is possible that the distance
between N and M can change in another sequence. It is also possible that the new
sequence could be of the form xM xN x. Thus, during the alignment it may be necessary
to identify non-coding regions and lower their importance.
Given two sequences and the task of global alignment it is still necessary to be
concerned with the beginning and ending of the sequences. The sequencing technology
tends to have problems calling the very beginning and very end of sequences. Thus, the
actual sequence may be longer than necessary. For the case of xNx the leading and trailing
x part of the sequence may be any length and thus during the global alignment it may
still be necessary to exclude leading and trailing portions of a sequence.

21.1.3.2 Sequence Length

Another complicating factor is sequence length. Often the alignment algorithms are based
on the number of matches. Consider a case in which the sequences are 100 elements long
and 90 of them align. Consider a second case in which the sequences are 1000 elements
long and 800 of them align. In the second case the score can be higher since many more
elements aligned, but the percentage of alignment is greater in the first case. So, some
algorithms consider the sequence length when producing an alignment score.

21.1.4 Simple Alignments

Aligning two strings sounds like a simple process but it has long ago been mostly aban-
doned in a majority of bioinformatics applications. This section will explore simple align-
ment algorithms and the reasons for more complicated engines.

275
21.1.4.1 Direct Alignment

This is an extremely simple concept. Given two sequences a score is computed by adding
a positive number for each match and a negative number is added for a mismatch. A
simple example with a total score of 4 is:

RNDKPKFSTARN
RNQKPKWWTATN
++-+++--++-+

An alignment between two different letters in this case counts as -1. An alignment
of a letter with a gap is also a mismatch but perhaps should be counted as a bigger
penalty, for example -2. In this fashion alignment with gaps is more discouraged than just
mismatched letters. Code 21.1 displays the function SimpleScore which performs this
comparison. In lines 3 and 4 the strings are converted to arrays in which each element
is a single letter. Line 5 counts the number of matching characters and line 6 counts the
number of mismatching characters. Line 7 removes the penalty for locations where a gap
is aligned with a gap. Some examples are shown.

Code 21.1 The SimpleScore function.

1 # simplealign . py
2 def SimpleScore ( s1 , s2 ) :
3 a1 = np . array ( list ( s1 ) )
4 a2 = np . array ( list ( s2 ) )
5 score = ( a1 == a2 ) . astype ( int ) . sum ()
6 score = score -( a1 != a2 ) . astype ( int ) . sum ()
7 ngaps = s1 . count ( ' - ' ) + s2 . count ( ' - ' )
8 score = score - ngaps
9 return score
10

11 >>> import simplealign as sal


12 >>> sal . SimpleScore ( ' AGTCGATCGATT ' , ' AGTCGATCGATT ' )
13 12
14 >>> sal . SimpleScore ( ' AGTCGATCGATT ' , ' AGTCGATCGAAT ' )
15 10
16 >>> sal . SimpleScore ( ' AGTCGATCGATT ' , ' AGTCGATCGA-T ' )
17 9

21.2 Statistical Alignment

In reality the mismatched characters in aligning amino acid sequences are not counted
equally. Through evolution some amino acid changes are more frequent than others. This

276
Table 21.1: The BLOSUM50 matrix.

A R N D C Q E G H I L K M F P S T W Y V
A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 -2
R -2 7 -1 -2 -1 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -1
N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 -1 0 -4 -2 -2
D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -3
C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -3
Q -1 -1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -1
E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -2
G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -3
H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 -1 -1
I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 -1
L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 -1
K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -2
M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 0
F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 4
P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3
S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 -2
W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 2
Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 8
V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 -1

is also complicated by the fact that some amino acids are more commonly seen than others.
Analysis of these changes have led to the creation of substitution matrices. There are
several versions depending on the mathematical methods employed and the evolutionary
time step. The most popular matrices are either PAM or BLOSUM and the number that
follows the name indicates the evolutionary time step. The matrices have some differences
but are similar enough that only one matrix will be used here.

21.2.1 Substitution Matrix

Since there are 20 amino acids the substitution matrix is 20×20. This matrix indicates the
log-odds of a substitution. The matrices may be presented in different arrangements and
so it is necessary to first define the alphabet that is associated with the use of a matrix.
In this case the alphabet is: ’ARNDCQEGHILKMFPSTWYV’.
The BLOSUM50 matrix is shown in Table 21.1. Each row and column is associated
with a letter. So the log-odds of a an ‘A’ changing to an ‘R’ is -2.
The odds of an event occurring is based on the probability of the event normalized
by the probability of the constituents existing. The range of the values for the odds can be
large and that is one of the reason that the log-odds are used. Negative values indicate that

277
the odds were less than 1 and positive values indicated that the odds were greater than
1. Thus, positive values in the table are events that are likely to occur with larger values
indicating a larger chance. As seen all of the events in which an item remains unchanged
(values down the diagonal of the matrix) are positive and are the largest values.
The score for an alignment of two amino acids comes from this table. For example,
a ‘D’ aligned with a ‘K’ has a score of -1. An example alignment is

R N D K P K F S T A R N
R N Q K P K W W T A T N
7 7 0 6 10 6 1 -4 5 5 -1 7

The blosum module contains both the BLOSUM50 matrix and the associated alpha-
bet. Code 21.2 shows a small part of the matrix and the entire alphabet.

Code 21.2 Accessing the BLOSUM50 matrix and its associated alphabet.

1 >>> import blosum


2 >>> blosum . BLOSUM50 [:4 ,:4]
3 array ([[ 5 , -2 , -1 , -2] ,
4 [-2 , 7 , -1 , -2] ,
5 [-1 , -1 , 7 , 2] ,
6 [-2 , -2 , 2 , 8]])
7 >>> blosum . PBET
8 ' ARNDCQ E G H I L K M F P S T W Y V '

21.2.2 Accessing Values

The next task is to obtain the correct value from the substitution matrix for a given
alignment. Consider the case in which the task is to compute the alignment score for
‘RNDKPKFSTARN’ with ‘RNQKPKWWTATN’. The first two letters are ‘R’ and ‘R’
and the task is to get the correct value from the BLOSUM matrix for this alignment.
The location of the target letter (in this case ‘R’) in the alphabet is shown in Code
21.3. Line 1 returns the location of the target letter and in this case this step is for both
strings. Line 3 then retrieves the value from the matrix for the alignment of an ‘R’ with
an ‘R’.

Code 21.3 Accessing an element in the matrix.

1 >>> blosum . PBET . index ( ' R ' )


2 1
3 >>> blosum . BLOSUM50 [1 ,1]
4 7

278
Table 21.2: Possible alignments.

abc.. abc. abc .abc ..abc


..bcd .bcd bcd bcd. bcd..

Now consider the third letter in each string. The task is to align a ‘D’ with a
‘Q’. The process is shown in Code 21.4. The substitution matrix is symmetric and so,
blosum.BLOSUM50[3,5] = blosum.BLOSUM50[5,3].

Code 21.4 Accessing an element in the matrix.

1 >>> blosum . PBET . index ( ' D ' )


2 3
3 >>> blosum . PBET . index ( ' Q ' )
4 5
5 >>> blosum . BLOSUM50 [3 ,5]
6 0

Now, the process is reduced to repeating these steps for each character position in
the strings. This is accomplished with the BlosumScore function shown in Code 21.5.
The inputs are the substitution matrix with its alphabet, the two sequences, and a gap
penalty. Since the strings could be different lengths it is necessary to find the length of
the shortest string which is done in line 4. Then the loop begins and situations with gaps
are first considered. The alignment score for letters starts in line 11. The indexes of the
two letters are retrieved and they are used to get the value from the BLOSUM matrix.
The score is accumulated in the variable sc and returned at the end of the function. An
example is shown.

21.3 Brute Force Alignment

Commonly, the two sequences are not aligned, but rather the alignment needs to be de-
termined. The most simplistic and costliest method is to consider all possible alignments.
Consider the alignment of two small sequences abc and bcd. Table 21.2 shows the five
possible shifts to align the two sequences. The periods are used as place holders and do
not represent any data.
There are actually three types of shifts shown. The first two examples shift the
second sequence towards its right, the third example has neither shifted, and the last two
examples have the first sequence shifted to the right. It is cumbersome to create a program
that considers these three different types of shift. An easier approach is to create two new
strings which have the original data and empty elements represented by dots. In this case
the new string t1 contains the old string Seq1 and N2 empty elements (where N2 is the
length of Seq2). The string t2 is created from N1 empty elements and the string Seq2.

279
Code 21.5 The BlosumScore function.
1 # simplealign . py
2 def BlosumScore ( mat , abet , s1 , s2 , gap =-8 ) :
3 sc = 0
4 n = min ( [ len ( s1 ) , len ( s2 ) ] )
5 for i in range ( n ) :
6 if s1 [ i ] == ' - ' or s2 [ i ] == ' - ' and s1 [ i ] != s2 [ i ]:
7 sc += gap
8 elif s1 [ i ] == ' . ' or s2 [ i ] == ' . ' :
9 pass
10 else :
11 n1 = abet . index ( s1 [ i ] )
12 n2 = abet . index ( s2 [ i ] )
13 sc += mat [ n1 , n2 ]
14 return sc
15

16 >>> sc = sal . BlosumScore ( blosum . BLOSUM50 , blosum . PBET ,


17 ' RNDKPKFSTARN ' , ' RNQKPKWWTATN ' )
18 >>> sc
19 49

t1 = ’abc...’
t2 = ’...bcd’
Finding all possible shifts for t1 and t2 is quite easy. By sequentially removing the
first character of t2 and the last character of t1 all possible shifts are considered. Table
21.3 shows the iterations and the strings used for all possible shifts. In this case iteration
2 would provide the best alignment.
The number of iterations is N1+N2-1 and the result of the computation is now a
vector with the alignment scores for all possible alignments. Code 21.6 shows the func-
tion BruteForceAlign which creates the new strings and considers all possible shifts.
The scoring of each shift is performed by BlosumScore but certainly other scoring func-
tions can be used instead. Since BlosumScore is capable of handling strings of different
lengths it is not necessary to actually create t2 in this function. This function uses the
multiplication sign with a string in line 4 to create a string or repeating characters (see
Section 6.4.1.3).
This function returned several values which are the alignment scores for every possi-
ble alignment between these two sequences. The best alignment is the one with the largest
score. The location indicates the shift necessary of one of the sequences in order to align
the sequences. Consider the case shown in Code 21.7. Lines 1 and 2 create two strings
which are similar except that one string also has a set of preceding ‘A’s. Thus, to obtain
the best alignment the string s1 needs to be shifted to the right by 5 spaces. The function

280
Table 21.3: Shifts for each iteration.

Iteration Strings
0 abc...
...bcd
1 abc..
..bcd
2 abc.
.bcd
3 abc
bcd
4 ab
cd
5 a
d

Code 21.6 The BruteForceSlide function.


1 # easyalign . py
2 def BruteForceSlide ( mat , abet , seq1 , seq2 ) :
3 l1 , l2 = len ( seq1 ) , len ( seq2 )
4 t1 = len ( seq2 ) * ' . ' + seq1
5 lt = len ( t1 )
6 answ = np . zeros ( lt , int )
7 for i in range ( lt ) :
8 answ [ i ] = BlosumScore ( mat , abet , t1 [ i :] , seq2 )
9 return answ
10

11 >>> v = sal . BruteForceSlide ( BLOSUM50 , PBET , ' RNDKPKFSTARN ' , '


RNQKPKWWTATN ' )
12 >>> v
13 array ([ 0 , - 1 , 6, 0 , - 3 , - 8 , -10 , -13 , -14 , - 9 , 2, -
5 , 49 ,-
14 12 , - 9 , - 9 , -16 , - 4 , - 7 , - 1 , - 1 , - 3 , 14 , - 1])

281
BruteForceSlide is called in line 3 and the set of values are returned as a vector v.

Code 21.7 Aligning the sequences.

1 >>> s1 = ' RNDKPKFSTARN '


2 >>> s2 = ' AAA AARNQK PKWWTA TN '
3 >>> v = sal . BruteForceSlide ( blosum . BLOSUM50 , blosum . PBET , s1 ,
s2 )
4 >>> len ( s2 ) - v . argmax ()
5 5
6 >>> ' . ' *5 + s1
7 ' ..... RNDKPKFSTARN '
8 >>> s2
9 ' AAAAARNQKP KWWTAT N '
10

11 >>> s1 = ' A AA A A AA R N DK P K FS T A RN '


12 >>> s2 = ' RNQKPKWWTATN '
13 >>> v = sal . BruteForceSlide ( blosum . BLOSUM50 , blosum . PBET , s1 ,
s2 )
14 >>> len ( s2 ) - v . argmax ()-
15 7
16 >>> s1
17 ' AAAAA A AR N D KP K F ST A R N '
18 >>> 7* ' . ' + s2
19 ' ....... RNQKPKWWTATN '

In this vector there is a single value that is much higher than all of the others. The
location of this maximum value is obtained by v.argmax(). This location depends on
the shift necessary to align the two strings and the lengths of the strings. This value is
computed in line 4. Since line 5 is positive line 6 is used to create aligned strings. This
command adds 5 periods in front of s1 so that it will aliwng with s2 as shown in lines 7
and 9.
A second example starts in line 11 which creates the same strings except that s1
now has the additional characters in front. The same process is called except in this case
the value in line 15 is negative. Thus, the periods are inserted in front of s2 in order to
get the two strings to properly align.

21.4 Dynamic Programming

The previous system is slow, simple and effective as long as all of nucleotides are known
and that evolution has not changed the DNA string lengths. However, neither of these
are guaranteed to be true and often they are not. Therefore a more powerful method is

282
required.
Consider the alignment of two strings ‘TGCGTAG’ and ‘TGGTAG’. These two
strings are very similar and would align perfectly if there was an additional nucleotide
inserted into the second string at the third position. A gap is then inserted to space the
strings to align properly as in.

A = TGCGTAG
B = TG-GTAG

The difficulty in comparing two sequences then is knowing where to put the gaps.
Certainly, it is possible to attempt a gap at every location. In this case sequence A would be
compared to ‘TGGTA-G’, ‘TGGT-AG’, ‘TGG-TAG’, ‘TG-GTAG’, and ‘T-GGTAG’. This
is not a difficult task even for long strings. However, it is quite probable that gaps will be
needed in more than one places. So, to perform a thorough study the strings ‘T-GGTA-G’,
‘T-GGT-AG’, etc. would also have to be considered. Furthermore, it may be necessary
to consider more than one gap in a single location so strings such as ‘TGGTA–G’ and ‘T-
GGT–AG’ would have to be considered. There are also possibilities of more than two gaps
and also gaps may be necessary in the first string in the comparison. Aside from all of the
possible locations for the gaps, each comparison requires several shifts of the strings just
to find a best alignment. Obviously, the number of possible alignments is exponential with
respect to the strings lengths. Since many sequencing machines can provide information
for strings with over 300 nucleotides and exhaustive search is computationally prohibited.
There are multiple methods to adapt to this problem. Programs such as BLAST
start with small segments of alignments work towards larger alignments employing esti-
mations. This method is very fast but may not find the best alignments. It is used to
compare a DNA or protein string to a large library. Since the amount of data is vast there
are many alignments that are returned that can be studied. Often this information is suf-
ficient to understand the purpose of the query string even if some of the best alignments
were not returned by the program.
The method of dynamic programming does a much better job at inserting gaps for
the best alignment but it is computationally more expensive.
The dynamic programming approach attempts to find an optimal alignment by con-
sidering three options for each location in the sequence and selecting the best option before
considering the next location. Each iteration considers the alignment of two bases (one
from each string) or the insertion of a gap in either string. The best of the three are chosen
and the system then moves on to the next elements in the sequence.
To handle this efficiently the computer program maintains a scoring matrix and an
arrow matrix. This program will also use a substitution matrix such as BLOSUM or PAM.

283
21.4.1 The Scoring Matrix

The scoring matrix, S, maintains the current alignment score for the particular alignment.
Since it is possible to insert a gap at the beginning of a sequence the size of the scoring
matrix is (N1 + 1) × (N2 + 1) where N1 is the length of the first sequence. Consider
two sequences, S1 and S2 (Code 21.8) that contain some similarities. The lengths of the
sequences are N1 = 15 and N2 = 14 and thus the scoring matrix is 16 × 15.

Code 21.8 Creating two similar sequences.

1 S1 = ' IQIFSFIFRQEWNDA '


2 S2 = ' QIFFFFRMSVEWND '

Alignment with a gap is usually penalized more than any mismatch of amino acids, so
for this example gap = -8 but certainly other values can be used to adjust the performance
of the system. The alignment of the first character with a gap is a penalty of -8 and the
alignment with the first two characters with two gaps is -16, and so on. The scoring matrix
is configured so that the first row and first column considers runs of gaps aligning with the
beginning of one of the sequences. Thus, the first step in construction the scoring matrix
is to establish the first row and first column as shown in Figure 21.1.

Figure 21.1: The first column and row are filled in.

The next step is to fill in each cell in a raster scan. The first undefined cell considers
the alignment of I with Q or either one with a gap. There are three choices and the
selection is made by choosing the option that provides the maximum value,

Sm−1,n + gap

Sm,n = max Sm,n−1 + gap , (21.2)

Sm−1,n−1 + B(a, b)

284
where the B(a, b) indicates the entry from the scoring matrix for residues a and b.
Normally, the first entry in the matrix is denoted as S1,1 , but in order to be congruent
with Python the first cell in the matrix is S0,0 and the first cell that needs to be computed
is S1,1 . To be clear it should be noted that this cell aligns the first characters in the two
strings S1[0] and S2[0], thus the indexing of the strings is slightly different than the
matrix locations. In the example the cell S1,1 considers the alignment of the first two
letters in each sequence. With m=1, n=1, a=’I’, and b=’Q’ the first cell has the following
computation,


−8 − 8

S1,1 = max −8 − 8 . (21.3)

0−3

and the obvious choice is the third option. The results are shown in Figure 21.2.

Figure 21.2: The S1,1 cell is filled in.

21.4.2 The Arrow Matrix

It is necessary to keep track of the choices made for each cell. Once the entire scoring
matrix is filled out it will be necessary to use it to extract the optimal alignment. Thus,
the algorithm requires the use of a second matrix named the arrow matrix. The arrow
matrix is used to find which cell was influential in determining the value of the subsequent
cell. In the previous example, the third choice was selected which indicates that S0,0 was
the cell that influenced the value of S1,1 . The arrow matrix will place one of three integers
(0,1,2) in the respective cells and so R1,1 would contain a 2.

285
21.4.3 The Initial Program

The dynprog module contains several functions that are used in dynamic programming.
The first function is ScoringMatrix which creates the scoring matrix and the arrow
matrix in a straightforward manner. This function is shown in Code 21.9. It receives a
substitution matrix and its associated alphabet along with the two strings to be aligned,
and it returns the scoring matrix and the arrow matrix.

Code 21.9 The ScoringMatrix function.

1 # dynprog . py
2 def ScoringMatrix ( mat , abet , seq1 , seq2 , gap =-8 ) :
3 l1 , l2 = len ( seq1 ) , len ( seq2 )
4 scormat = np . zeros ( ( l1 +1 , l2 +1) , int )
5 arrow = np . zeros ( ( l1 +1 , l2 +1) , int )
6 scormat [0] = np . arange ( l2 +1) * gap
7 scormat [: ,0] = np . arange ( l1 +1) * gap
8 arrow [0] = np . ones ( l2 +1)
9 for i in range ( 1 , l1 +1 ) :
10 for j in range ( 1 , l2 +1 ) :
11 f = np . zeros ( 3 )
12 f [0] = scormat [ i-1 , j ] + gap
13 f [1] = scormat [i , j-1] + gap
14 n1 = abet . index ( seq1 [ i-1] )
15 n2 = abet . index ( seq2 [ j-1] )
16 f [2] = scormat [ i-1 , j-1] + mat [ n1 , n2 ]
17 scormat [i , j ] = f . max ()
18 arrow [i , j ] = f . argmax ()
19 return scormat , arrow

These two matrices are one column and row bigger than the two input strings. These
new lengths are determined in line 3 and the matrices created in lines 4 and 5. The first
row and columns are populated in lines 6 and 7. The dynamic programming is computed
starting in line 11. This creates a three element vector f which will hold the three possible
computations for a single cell in the scoring matrix. The three possibilities are computed
in lines 12 through 16. The highest score is then selected to populate a cell in both the
scoring matrix and the arrow matrix in lines 17 and 18.
The program is called in Code 21.10, and the hard part of the dynamic programming
algorithm as been accomplished. This function does perform the steps but it also has a
double nested loop which in an interpreted language is slow. A faster version will be shown
in Section 21.4.5, but before that is reviewed the process of extracting the best alignment
is pursued.

286
Code 21.10 Using the ScoringMatrix function.

1 >>> import dynprog as dpg


2 >>> s1 = ' IQIFSFIFRQEWNDA '
3 >>> s2 = ' QIFFFFRMSVEWND '
4 >>> scormat , arrow = dpg . ScoringMatrix ( blosum . BLOSUM50 , blosum .
PBET , s1 , s2 )

21.4.4 The Backtrace

The final step is to extract the aligned sequences from the arrow matrix. The process starts
at the lower right corner of the arrow matrix and works towards the upper left corner.
Basically, the aligned sequences are created from back to front. Code 21.11 displays the
arrow matrix for the current example. In the lower right corner the entry is a 0 which
indicates that this cell was created from the first choice of Equation (21.2). It aligned the
last character of the first string with a gap and thus the current alignment is,

Q1 = ’A’
Q2 = ’-’

The value of 0 also indicates that the next cell to be considered is the one above the
current position since a letter from S2 was not used. This next location in the arrowmat
contains a 2 which indicates that two letters are aligned.

Q1 = ’DA’
Q2 = ’D-’

A value of 2 indicates that the backtrace moves up and to the left. Code 21.11
shows the arrow matrix and in bold are the locations used in the backtrace. Each time
a 0 is encountered a letter from S1 is aligned with a gap and the backtrace moves up on
location. Each time a 1 is encountered the letter from S2 is aligned with a gap and the
backtrace moves to the left. Each time a 2 is encountered a letter from both sequences
are used and the backtrace moves up and to the left.
Code 21.12 shows the BackTrace function. The two strings that are being aligned
are st1 and st2. The backtrace starts at the lower right corner and works it way up to
the upper left in the while loop starting Line 8. For each cell it appends a letter or gap
to each sequence depending on the value in the arrow matrix. There are four choices with
lines 9, 13, and 17 representing the three choices in Equation (21.2). The choice offered
in line 22 occurs when the trace reaches the top row or first column. Within each choice
there is an adjustment to st1 and st2 and then a change to the new locations v and h.
The strings are constructed in a reverse order and so the last two lines of code are used
to reverse the strings into the correct order. The example call shows the alignment of the
two test strings.

287
Code 21.11 The arrow matrix.
1 >>> arrow
2 array ([[1 , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ,
3 [0 , 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ,
4 [0 , 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1] ,
5 [0 , 0, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1] ,
6 [0 , 0, 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1] ,
7 [0 , 0, 0, 0, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1] ,
8 [0 , 0, 0, 0, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1] ,
9 [0 , 0, 0, 0, 0, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1] ,
10 [0 , 0, 0, 0, 0, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1] ,
11 [0 , 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 1, 2, 1] ,
12 [0 , 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 2, 1, 1, 1] ,
13 [0 , 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 1, 1, 1] ,
14 [0 , 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 1, 2, 1, 1] ,
15 [0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 1] ,
16 [0 , 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 0, 2] ,
17 [0 , 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2, 0, 0]])

21.4.5 Speed Considerations

The ScoringMatrix does work but is slow. The reason is that Python is an interpreted
language and ScoringMatrix has a double nested loop. For a single alignment of se-
quences with about 300 bases the previous programs can find a solution in a reasonable
amount of time. If the project requires many dynamic programming applications then
speed becomes a serious issue. The goal then is to perform the same operations without
using nested Python loops.
Consider again function ScoringMatrix. Actually, this program is a triple nested
loop. There are the two for loops in lines 9 and 10, but the third loop is a bit covert.
The vector f has three elements and each of these are considered when computed lines 17
and 18. This loop is contained within the scipy functions max and argmax. The goal of
the functions are to perform the same operations but use scipy functions to perform some
of the loops, leaving only one loop written in Python.
This process is performed with three functions. The first is FastSubValues shown
in Code 21.13. The computation for each cell in the scoring matrix will require a value
from the substitution matrix. These values need to be extracted using a single Python
for loop. To accomplish this, random slicing techniques are employed (see Code 11.10).
The FastSubValues function requires a matrix which contains that BLOSUM or PAM
values that will be used at each location in the scoring matrix.
A partial result is shown starting in line 17. The first row and column in the scoring
matrix will not use substitution values and so those are 0. The first nonzero element

288
Code 21.12 The Backtrace function.
1 # dynprog . py
2 def Backtrace ( arrow , seq1 , seq2 ) :
3 st1 , st2 = ' ' , ' '
4 v , h = arrow . shape
5 ok = 1
6 v-=1
7 h-=1
8 while ok :
9 if arrow [v , h ] == 0:
10 st1 += seq1 [ v-1]
11 st2 += ' - '
12 v -= 1
13 elif arrow [v , h ] == 1:
14 st1 += ' - '
15 st2 += seq2 [ h-1]
16 h -= 1
17 elif arrow [v , h ] == 2:
18 st1 += seq1 [ v-1]
19 st2 += seq2 [ h-1]
20 v -= 1
21 h -= 1
22 if v ==0 and h ==0:
23 ok = 0
24 # reverse the strings
25 st1 = st1 [::-1]
26 st2 = st2 [::-1]
27 return st1 , st2
28

29 >>> st1 , st2 = dpg . Backtrace ( arrow , s1 , s2 )


30 >>> st1
31 ' IQIFSFIFRQ--EWNDA '
32 >>> st2
33 ' -QIF-FFFRMSVEWND- '

289
Code 21.13 The FastSubValues function.
1 # dynprog . py
2 def FastSubValues ( mat , abet , seq1 , seq2 ) :
3 l1 , l2 = len ( seq1 ) , len ( seq2 )
4 subvals = np . zeros ( ( l1 +1 , l2 +1) , int )
5 si1 = np . zeros ( l1 , int )
6 si2 = np . zeros ( l2 , int )
7 for i in range ( l1 ) :
8 si1 [ i ] = abet . index ( seq1 [ i ] )
9 for i in range ( l2 ) :
10 si2 [ i ] = abet . index ( seq2 [ i ] )
11 for i in range ( 1 , l1 +1 ) :
12 subvals [i ,1:] = mat [ [ si1 [ i-1]]* l2 , si2 ]
13 return subvals
14

15 >>> subvals = dpg . FastSubValues ( blosum . BLOSUM50 , blosum . PBET


, s1 , s2 )
16 >>> subvals [:4 ,:4]
17 array ([[ 0 , 0 , 0 , 0] ,
18 [ 0 , -3 , 5 , 0] ,
19 [ 0 , 7 , -3 , -4] ,
20 [ 0 , -3 , 5 , 0]])

290
is subvals[1,1] which has a value of -3. The first two letters in the two sequences
to be aligned are ‘I’ and ‘Q’. The alignment of these two letters is computed in the
scormat[1,1] and in order to make this computation the algorithm needs to substitution
value from BLOSUM for ‘I’ and ‘Q’. This value is -3 and is thus located in subvals[1,1].
The rest of the matrix is populated with the substitution values that are needed for that
cell’s computation. While there are three for loops in FastSubValues the speed is still
optimal as none of these loops are nested.
One of the issues in creating the scoring matrix is that it is not possible to compute
a single row or single column at the same time. A cell requires knowledge of the cell
above and the cell to the left. However, it is possible to compute all of the values along
a diagonal as shown in Figure 21.3. The elements in a contiguous line can be computed
concurrently. Four such lines are shown, but they would continue until the lower right of
the matrix is reached. The Python for loop is then moving from one line to the next.
The computations for all of the values on a line must then be performed without a Python
for loop.

Figure 21.3: The lines indicate which elements are computed in a single Python command.

The next step is to obtain the indexes of all of the elements along a line. The first
line has only a single entry and the index for that element is [1,1]. The second line has
two entries and those indexes are [1,2] and [2,1]. The third line has three entries and so
on. The pattern is quite easy except when the bottom or right of the scoring matrix is
reached. This also is dependent on the shape of the matrix. In this case there are more
columns than rows and so the last row will be reached before the last column.

291
The function CreateIlist receives the lengths of the two protein strings and then
returns a list of indexes. Again this function uses a single Python for loop.

Code 21.14 The CreateIlist function.


1 # dynprog . py
2 def CreateIlist ( l1 , l2 ) :
3 ilist = []
4 for i in range ( l1 + l2 -1 ) :
5 st1 = min ( i +1 , l1 )
6 sp1 = max ( 1 , i-l2 +2 )
7 st2 = max ( 1 , i-l1 +2 )
8 sp2 = min ( i +1 , l2 )
9 ilist . append ( ( np . arange ( st1 , sp1-1 ,-1) , np . arange ( st2 ,
sp2 +1) ) )
10 return ilist

Code 21.15 shows the purpose of CreateIlist by example. This example considers
strings of length 5 and 4. Line 1 calls the function. The first item in the list is the index
of the element for the first line similar to Figure 21.3. Line 5 shows the indexes for the
next line. The last line shows the case for the 5th diagonal line. In this example the last
row is reached and so the pattern is modified.

Code 21.15 Using the CreateIlist function.

1 >>> ilist = dpg . CreateIlist ( 5 , 4 )


2 >>> ilist [0]
3 ( array ([1]) , array ([1]) )
4 >>> ilist [1]
5 ( array ([2 , 1]) , array ([1 , 2]) )
6 >>> ilist [2]
7 ( array ([3 , 2 , 1]) , array ([1 , 2 , 3]) )
8 >>> ilist [3]
9 ( array ([4 , 3 , 2 , 1]) , array ([1 , 2 , 3 , 4]) )
10 >>> ilist [4]
11 ( array ([5 , 4 , 3 , 2]) , array ([1 , 2 , 3 , 4]) )

The final function is FastNW which is displayed in Code 21.16. This is similar to
ScoringMatrix in function, however there is only a single Python for loop. The variable
f is now a matrix which contains the three dynamic programming choices for all elements
along a diagonal. The size is 3× LI where LI is the number of cells along a diagonal. The
variable maxpos in line 21 is the dynamic programming choice (see equation (21.2)) for
each element along the diagonal. The index i is not an integer, but is one of the items
from the list ilist. So, lines 22 and 23 populate all of the elements in the scoring matrix

292
and arrow matrix along that diagonal.

Code 21.16 The FastNW function.


1 # dynprog . py
2 def FastNW ( subvals , seq1 , seq2 , gap =-8 ) :
3 l1 , l2 = len ( seq1 ) , len ( seq2 )
4 scormat = np . zeros ( ( l1 +1 , l2 +1) , int )
5 arrow = np . zeros ( ( l1 +1 , l2 +1) , int )
6 scormat [0] = np . arange ( l2 +1) * gap
7 scormat [: ,0] = np . arange ( l1 +1) * gap
8 arrow [0] = np . ones ( l2 +1 )
9 ilist = CreateIlist ( l1 , l2 )
10 for i in ilist :
11 LI = len ( i [0] )
12 f = np . zeros ( (3 , LI ) , float )
13 x , y = i [0]-1 , i [1]+0
14 f [0] = scormat [x , y ] + gap
15 x , y = i [0]+0 , i [1]-1
16 f [1] = scormat [x , y ] + gap
17 x , y = i [0]-1 , i [1]-1
18 f [2] = scormat [x , y ] + subvals [ i ]
19 f += 0.1 * np . sign ( f ) * np . random . ranf ( f . shape )
20 mx = ( f . max (0) ) . astype ( int ) # best values
21 maxpos = f . argmax ( 0 )
22 scormat [ i ] = mx + 0
23 arrow [ i ] = maxpos + 0
24 return scormat , arrow

Code 21.17 shows the two commands needed to create the scoring and arrow ma-
trices. These results are the same as those from ScoringMatrix but the computational
speed is significantly improved.

Code 21.17 Using the FastNW function.

1 >>> subvals = dpg . FastSubValues ( blosum . BLOSUM50 , blosum . PBET


, s1 , s2 )
2 >>> scormat , arrow = dpg . FastNW ( subvals , s1 , s2 )

21.5 Global and Local Alignments

There are two common cases in the application of alignments. The first is that the begin-
ning and the end of the genes are known and so the two strings that are to be aligned have

293
a definite beginning and ending. This is a global alignment since the process aligns the
entirety of both strings. This is named Needleman-Wunsch alignment after the creators,
which is the reason that the fast function is called FastNW.
The second case is that the user has two strings of DNA and inside of these are
regions of interest but the beginning and ending of these regions is not known. Thus, the
user is interested in finding a portion of one string that aligns with a portion of the other.
This is called local alignment since the alignment result generally uses only a part of each
string. This is also called Smith-Waterman alignment and so the function that creates the
scoring and arrow matrices is named FastSW.
The Smith-Waterman algorithm is a local alignment process that attempts to find
the best substrings within the two strings that align. It accomplishes this through only a
couple of modifications. The first is to adjust the selection equation such that no negative
numbers are excepted,



Sm−1,n + gap

S
m,n−1 + gap
Sm,n = max . (21.4)


Sm−1,n−1 + B(a, b)

0

A modification to the backtrace is also required. Instead of starting at the lower


right corner, the trace is started at the location which has the largest value in the scoring
matrix. The backtrace proceeds in the same manner except that it stops when a value in
the scoring matrix is 0. Thus, the trace does not go to the upper left corner. This new
function is named SWBacktrace .
An example is shown in Code 21.18. Two sequences are defined and the scoring
matrix and arrow matrix are computed in line 4. The scoring matrix is shown and the
new algorithm prevents this matrix from having any negative values. The backtrace starts
at the location with the largest value and in this case that value is 26 and located at
scmat[6,6]. The arrow is matrix indicates which direction the backtrace follows and it
continues until the trace reaches a 0 element in the scoring matrix.
An example is shown in Code 21.19. The subvals are computed and the FastSW
function creates the two matrices. These are fed into the SWBacktrace function to
compute the locally aligned sequences. As seen both input sequences have ‘TIFF’ but the
rest of the sequences are poorly matched. Only this portion is returned. In this case, gaps
are not needed.

21.6 Gap Penalties

The programs as presented use a standard gap penalty. The cost of each gap is the same
independent of its location and independent of consecutive run of gaps. In some views

294
Code 21.18 Results from the FastSW function.
1 >>> sq1 = ' KMTIFFMILK '
2 >>> sq2 = ' NQTIFF '
3 >>> subvals = dpg . FastSubValues ( B50 , PBET , sq1 , sq2 )
4 >>> scmat , arrow = dpg . FastSW ( subvals , sq1 , sq2 )
5 >>> scmat
6 array ([[ 0 , 0 , 0 , 0 , 0 , 0 , 0] ,
7 [ 0 , 0 , 2 , 0 , 0 , 0 , 0] ,
8 [ 0 , 0 , 0 , 1 , 2 , 0 , 0] ,
9 [ 0 , 0 , 0 , 5 , 0 , 0 , 0] ,
10 [ 0 , 0 , 0 , 0 , 10 , 2 , 0] ,
11 [ 0 , 0 , 0 , 0 , 2 , 18 , 10] ,
12 [ 0 , 0 , 0 , 0 , 0 , 10 , 26] ,
13 [ 0 , 0 , 0 , 0 , 2 , 2 , 18] ,
14 [ 0 , 0 , 0 , 0 , 5 , 2 , 10] ,
15 [ 0 , 0 , 0 , 0 , 2 , 6 , 3] ,
16 [ 0 , 0 , 2 , 0 , 0 , 0 , 2]])
17 >>> scmat . max ()
18 26
19 >>> divmod ( scmat . argmax () , 7 )
20 (6 , 6)

Code 21.19 A local alignment.

1 >>> sq1 = ' KMTIFFMILK '


2 >>> sq2 = ' NQTIFF '
3 >>> subvals = dpg . FastSubValues ( B50 , PBET , sq1 , sq2 )
4 >>> scmat , arrow = dpg . FastSW ( subvals , sq1 , sq2 )
5 >>> t1 , t2 = dpg . SWBacktrace ( scmat , arrow , sq1 , sq2 )
6 >>> t1
7 ' TIFF '
8 >>> t2
9 ' TIFF '

295
a consecutive run of gaps should be more costly than isolated gaps. These approaches
use an affine gap which adds extra penalties for consecutive gaps. This does complicate
the program somewhat as now it is necessary to keep track of the number of gaps when
computing Equation (21.4). For the purposes of this text, this option will not be explored.

21.7 Optimality in Dynamic Programming

Dynamic programming can provide a good alignment but is it the very best? Consider
Code 21.20 in which two random sequences are generated that are each 100 elements in
length. A substring of 10 elements is copied from the first string and replaces 10 elements
in the second string. Thus, there are two random strings except for 10 elements that are
exactly the same, and the Smith-Waterman algorithm should align these 10 elements only.
Code 21.20 shows an example that returns sequences that are much longer than the
expected length of 10. The last elements match but the random letters in front of the
matching sequence do not.

Code 21.20 An example alignment.

1 >>> np . random . seed ( 5290)


2 >>> r = ( np . random . rand ( 100 ) *20) . astype ( int )
3 >>> s1 = np . take ( list ( blosum . PBET ) , r )
4 >>> s1 = ' ' . join ( s1 )
5 >>> r = ( np . random . rand ( 100 ) *20) . astype ( int )
6 >>> s2 = np . take ( list ( blosum . PBET ) , r )
7 >>> s2 = ' ' . join ( s2 )
8 >>> s2 = s2 [:30] + s1 [10:20] + s2 [40:]
9 >>> subvals = dpg . FastSubValues ( blosum . BLOSUM50 , blosum . PBET
, s1 , s2 )
10 >>> scormat , arrow = dpg . FastSW ( subvals , s1 , s2 )
11 >>> t1 , t2 = dpg . SWBacktrace ( scormat , arrow , s1 , s2 )
12 >>> t1
13 ' LGY TWF V T I Q R M V Q V D P L G P I '
14 >>> t2
15 ' MAQ LWN C S D M R M V Q V D P L G P L '

Code 21.21 shows a worse example. The two sequences were generated randomly
as in Code 21.20. Code 21.21 shows what was generated and then the process of aligning
the sequences through Smith-Waterman. The returned sequences are much longer than
10 elements this time. The first ten elements match but the rest do not. This implies two
items. First, the largest value in the scoring matrix was not at the end of the 10 element
alignment but some other place 47 elements away from the end of the aligning strings.
The second item is that there were no 0 values in the scoring matrix from this peak to the

296
beginning of the aligning elements.

Code 21.21 Returned alignments are considerably longer than 10 elements.

1 >>> s1
2 ' KKPGHWMVRCKQGQKRVGLNRYMDNYSSPKNHMVRDHFHLWKWMPSENC
3 PAECWADKLWYIMKSCPADQPFTALKQVIAQTEEQVNYNNVGAHMAADSCT '
4 >>> s2
5 ' GGFMEGCCTPMYARTCVCDHCIGRVSERINKQGQKRVGLNLVRHGILIW
6 HNFLVGNQVWPWLMECFQAAGSTNKVYIREVPQIRKAIDYSLQYTINIVYL '
7 >>> subvals = dpg . FastSubValues ( B50 , PBET , s1 , s2 )
8 >>> scmat , arrow = dpg . FastSW ( subvals , s1 , s2 )
9 >>> t1 , t2 = dpg . SWBacktrace ( scmat , arrow , s1 , s2 )
10 >>> t1
11 ' K Q G Q K R V G L N R Y M D N Y S S P K N H M V R D H F H L W K W M P S E N C -PAEC WADKLW YIMKSC P '
12 >>> t2
13 ' K Q G Q K R V G L N L V R H G I L I W H N F L V G N Q -V-WPWL-ME-CFQAAGSTNKV-YI-REVP '

Figure 21.4 shows an image of the scoring matrix in which the darker pixels represent
larger values in the matrix. The slanted streaks are jets that appear in the scoring matrix
when alignment occurs. The main jet is quite obvious and starts at scmat[10,31] because
the first two aligning elements are s1[10] and s2[31]. The alignment should be only 10
elements long with the jet ending at scmat[20,41] but the jet does not end there. Recall
that the desire is to have the largest value in the scoring matrix at the location where
the two alignments end. This is a large number and is used to influence the subsequent
elements in the scoring matrix via Equation (21.4). The third option in this equation will
have two non-similar characters and the value returned from the BLOSUM matrix may be
negative but not enough to return a 0 for this option. Alignments after that may return
positive values from the BLOSUM matrix and thus increasing the values in the cells of
the scoring matrix after the alignment has ceased to exist. This is not a trivial problem
as can be seen in Figure 21.4. The Smith-Waterman process returned a large number of
characters after the alignment, but this alignment was not terminated from the fading of
a jet. It was terminated because one sequence had reached its end.
While this method did return the true alignment it also can return alignments of
random characters. Thus, this is not the best alignment. It should also be noticed that by
viewing the scoring matrix in terms of its jets other possible alignments are seen. There
are secondary jets that indicate other partial alignments between these two sequences.

21.8 Summary

Sequences can have bases added or deleted either through biology or errors in sequencing.
The locations of these insertions or deletions are unknown and their numbers are also

297
Figure 21.4: A pictorial view of the scoring matrix. Darker pixels relate to higher values in the
matrix.

unknown. A brute force computation that considers all possible combinations of align-
ments with insertions and deletions is computationally too expensive. Thus, the field has
adopted dynamic programming as a method of finding a good alignment with gaps.
Creating a dynamic programming alignment can be accomplished by following the
algorithm’s equations. However, this creations a double nested loop which can run slow in
Python. Thus, a modified approach is used to push one of the loops down into the compiled
code and leaves only one loop up in the Python script. This makes the algorithm run fast
by at least an order of magnitude.

Problems

1. Create a random sequence and copy it. In the copy remove a couple of letters at
different places. Use NW to align these two sequences.

2. Repeat the above problem but change the gap penalty. Does the alignment change
if the gap penalty is -16? Does it change if it is -2?

3. Create a scoring matrix which is,


(
5 i=j
M [i, j] = .
−1 i =
6 j

Align two amino acid sequences (of at least 100 characters) using the BLOSUM50
matrix and the above M matrix. Are the alignments significantly different?

4. Modify the BlosumScore algorithm to align DNA strings such that the 3-rd element
in each codon is weighted half as much as the other two.

298
5. Create a string with the form XAXBX where X is a set of random letters and A and
B are specific strings designed by the user. Each X can be a different length. Create
a second string with the form YAYBY where Y is a different set of random letters
and each Y can have a different length. Align the sequences using Smith-Waterman.
The scoring matrix will have two major maximum for the alignments of the A and
B regions. Modify the program to extract both alignments.

6. Is it possible to repeat Problem 2 where the second string is of the form YABY?

7. Create a program which aligns two halves of strings. For example, the first string,
str1, can be cut into two parts str1a and str1b where str1a is the first K characters
and str1b are the rest of the string. The second string is similarly cut into two
parts str2a and str2b at the same K. Align str1a with str2a (using Needleman-
Wunsch) and str1b with str2b. For each alignment compute the alignment score
using BlosumScore. Is it the same value as the alignment of str1 with str2?

8. Repeat Problem 4 for different values of K where K ranges from 10 to N − 10 (N is


the length of the strings). Did you find a case in which the alignment of the a and
b parts performs better than the alignment of the original strings?

9. Align two proteins using a BLOSUM matrix. Replace the substitution matrix with
M where,
(
5 i=j
Mi,j =
−1 i =
6 j
Repeat the alignment using this substitution matrix. Does it make much of a differ-
ence?

299
300
Chapter 22

Simulated Annealing

This chapter is a precursor to machine learning techniques and explores the process of
learning through simulated annealing.

22.1 Input to Output

In some experiments the input variables are known and the output results are known.
The part that is missing is understanding the mathematical model that can compute the
outputs from the inputs. In some cases, the model can be from a learning algorithm
which may provide an engine that can compute outputs from inputs but provide a concise
understanding of how that can be accomplished.
The user can provide a mathematical model and allow the machine learning algo-
rithms to determine coefficients in that model. If the model is incorrect then the machine
learning algorithm will fail to provide meaningful results.
Consider the case of an experiment with one input x and one output y. Three
experiments are run and the results are shown in Table 22.1. This data is clearly not
linear and if the user used the model y = ax + b then a machine learning algorithm will
not be able to find the correct values a and b.

Table 22.1: Simple Experiment

x y
1 0.4
2 1.5
3 3.3

301
22.2 Simulated Annealing

Consider the equation


~x · w
~ =2
where the elements of ~x are known. The task is then to find the elements of w
~ that can
make this equation be true.
For a three-dimensional case,

x[0]w[0] + x[1]w[2] + x[2]w[2] = 2

What are w[0], w[1], and w[2]?


In the simulated annealing approach random values are assigned to w. ~ Of course
~x · w
~ will not provide the correct answer. The values of w
~ are then changed in a controlled
random fashion. If the new version of w ~ provides a better answer (~x · w
~ is closer to 2) then
the new values are kept. If the new version of w~ is worse then the older values are kept.
At first the changes in w
~ are allowed to be very large. However, as the iterations
increase the range of the allowed changes shrinks. This is the annealing portion of the
algorithm.
This process is controlled by a cost function. Each version of w
~ is tested by the cost
function and if the cost is decreased then the w ~ is kept. The cost measures how poorly
~ performs. In the case of the example this is how far ~x · w
the w ~ is away from the target
value of 2.
Code 22.1 shows the CostFunction program for this example. The inputs are the
vectors x, w and the target value N which in this case is 2. Line 3 computes the dot product
and Line 4 computes the difference to the target value. This is the cost and is returned.

Code 22.1 The CostFunction function.


1 # simann1 . py
2 def CostFunction ( x , w , N ) :
3 dotprod = np . dot ( x , w )
4 err = abs ( dotprod - N )
5 return err

The cost function is unique to each problem and so this has to be written every time
there is a new application.
The second function is RunAnn which is shown in Code 22.2. The inputs are ~x
and N . Two optional inputs are the temperature temp and the annealing factor scaltmp.
These control the magnitude of the allowed changes and how fast this range shrinks.
Line 4 creates the initial random vector w~ and sets up the initial variables. Line 9
creates a new version of w~ called guess. It is based on random variations of the current

302
Code 22.2 The RunAnn function.
1 # simann1 . py
2 def RunAnn ( x , N , temp =1.0 , scaltmp =0.99 ) :
3 L = len ( x ) # number of elements in x
4 w = 2* np . random . rand ( L )-1
5 ok = True # flag to stop iterations
6 costs , i = [] , 0 # store costs from some iterations
7 cost = 999999 # start with some bogus large number
8 while ok :
9 guess = w + temp *(2* np . random . rand ( L )-1)
10 gcost = CostFunction ( x , guess , N )
11 if gcost < cost :
12 w = guess + 0
13 cost = gcost + 0
14 if i % 10 ==0:
15 costs . append ( cost )
16 i +=1
17 temp *= scaltmp
18 if cost <0.01 or temp <0.001:
19 ok = False
20 return w , np . array ( costs )

303
w.
~ The cost is computed in Line 10 and if that cost is better then w
~ becomes guess as in
Line 12 and the new cost is remembered in Line 13.
The cost of every 10th iteration is kept in a list named costs in Line 15. The
temperature controls the magnitude of the random variations in Line 9 and it shrinks a
little in Line 17. If the cost is low enough then the ok flag is set to False and the iterations
will cease. This function returns the final version of w ~ and the costs encountered from
every 10th iteration. These are plotted in Figure 22.1. Typical behavior is that the first
few iterations make great improvements and then it takes several iterations to make small
improvements

Figure 22.1: The costs.

The process is run in Code 22.3. Line 2 runs the whole algorithm and Lines 3 and
4 show that the result did provided a vector that made ~x · w
~ = 2 to be True.

Code 22.3 Using the RunAnn function.

1 # simann1 . py
2 >>> w , c = RunAnn (x ,N ,3)
3 >>> np . dot ( w , x )
4 1.9998 04 30 19 98 93 18

304
22.3 A Perpendicular Problem

Consider a different case. How is it possible to tell if two vectors are perpendicular? One
of the properties of perpendicular vectors is that there dot product is 0,

?
~x · ~y = 0

In N dimensional space it is possible to generate N − 1 random vectors and then


find a vector that is perpendicular to all of them. For example, if the problem were in 10
dimensional space then nine random vectors are generated, ~xi . There should be another
vector w~ that is perpendicular to all of them and therefore the following should be True,

~xi · w
~ =0 ∀i.

These ~xi vectors are easily generated in Code 22.4.

Code 22.4 The GenVectors function.


1 # simann2 . py
2 def GenVectors ( D =10 , N =9 ) :
3 vecs = np . random . ranf ( (N , D ) )
4 return vecs

The cost function is the sum of how far away from 0 each of the dot products is.
The simulation is shown in Code 22.5 which is quite similar to Code 22.2. The cost
function is so simple that it is computed in a single line rather than a separate function.
The cost function is computed in Line 5. The rest of the algorithm is similar to the
previous case.
Code 22.6 shows the call to run the simulation and the results. Line 3 computes
the dot product of w with all of the x vectors. If the simulation were perfect then all of
the values shown would be 0. However, Line 18 of Code 22.5 allows the iteration to stop
before perfection is reached.

22.4 Speed

The speed at which the annealing occurs is important. This is controlled by scaltmp. If
it is too fast (lower values of scaltmp) then a solution will not be found. If it is too slow
(values very, very close to 1.) then the computations will take a long time. The command
in Code 22.7 will not produce a good answer because the decay is too fast.

305
Code 22.5 The modified RunAnn function.
1 # simann2 . py
2 def RunAnn ( vecs , temp =1.0 , scaltmp =0.99 ) :
3 D = len ( vecs [0])
4 target = 2* np . random . rand ( D )-1
5 cost = abs ( np . dot ( vecs , target ) ) . sum () # sum of inner
prods
6 ok = 1
7 costs100 , i = [] ,0
8 while ok :
9 guess = target + temp *(2* np . random . rand ( D )-1)
10 gcost = ( abs ( np . dot ( vecs , guess ) ) ) . max ()
11 if gcost < cost :
12 target = guess + 0
13 cost = gcost + 0
14 if i % 100 ==0:
15 costs100 . append ( cost )
16 i +=1
17 temp *= scaltmp
18 if cost <0.001 or temp <0.01:
19 ok = 0
20 return target , np . array ( costs100 )

Code 22.6 Using the RunAnn function.

1 >>> vecs = GenVectors ( )


2 >>> w , c = RunAnn ( vecs ,1 ,0.9999 )
3 >>> np . dot ( vecs , w )
4

5 array ([-0.07009272 , -0.00873013 , -0.09353832 , -0.00614509 ,-


6 0.0571718 , 0.05866459 , -0.04219014 , -0.07652524 ,-
7 0.05635662])

Code 22.7 An example with a decay that is too fast.

1 >>> vecs = GenVectors ( )


2 >>> w , c = RunAnn ( vecs ,1 ,0.9 )

306
22.5 Meaningful Answers

The computer algorithm will always produce an answer, but it may not be a good answer.
The problem may be the decay speed or an incorrect model. It is always smart to test the
answer.
Previously, it was stated that if the number of dimensions was N then there are
N − 1 random vectors that are used. Consider a case in which the number of vectors is
N + 1. According to the theory, this should not work. There should not be a vector w~
that is perpendicular to all of the ~x vectors.
Code 22.8 shows the test in which 12 vectors of length 10 are created. These are
used as inputs to find the vector that is perpendicular to all 12. Of course this should fail.
However, the worst dot product of w ~ with an ~x vector is close to 0. This indicates that
the test which should have failed was actually successful. How is this possible?

Code 22.8 Checking the answer.

1 >>> vecs = GenVectors (10 ,12)


2 >>> w , c = RunAnn ( vecs , 1 , 0.9999 )
3 >>> abs ( np . dot ( vecs , w ) ) . max ()
4 0.0 05 3 8 2 7 7 6 0 8 2 3 7 4 0 1 2 7

Nothing went wrong. There is a vector whose dot product is 0 to all 12 vectors.
~ were 0 then ~xi · w
This vector is all zeros. If all elements in w ~ = 0 would be true for ll
vectors. However, this not really a vector and even though the math held the answer is
not a valid one.
The point is that the algorithm did provide an answer, but it is up to the user to
determine if this answer obtains their goals.

22.6 Energy Surface

The simulated annealing algorithm continually attempts to find a better solution. The
solution space is often considered to be a an energy surface. More cost is the same idea as
more energy. So, the idea is to lower the energy by lowering the cost. A two-dimensional
energy surface is shown in Figure 22.2 and the goal would be to find a solution at the
lowest point in the surface.
Simulated annealing starts with single, random solution which is equivalent to plac-
ing a ball at a random place on the surface. The process then tries a new solution by
slightly altering the old solution and this is the same as moving the ball a small distance
in one direction. If the height of the ball is lowered then the proposed solution is better
and it replaces the current solution. The process continues until the solution can not get
much better or other criteria.

307
Figure 22.2: An energy surface.

308
The energy surface that is shown is not too difficult and probably any starting
location would lead to the same solution. The energy surfaces in real problems, though,
are not mapped out and may have many different wells. So, it is very possible to have a
solution go towards the nearest well but that is not the deepest well. The term for this is
getting stuck in a local minimum. Without abandoning simulated annealing a manner to
solve a case with local minimum is to run the program several times, each with a different
starting point, and keeping the best answer. The best answer is not guaranteed to be the
best possible answer.

22.7 Text Data

In the previous sections, simulated annealing relies on the ability to slightly alter values
of the vector elements. Thus, some numbers would be changed by a small percent. A
value of 1.4 could become 1.5. Some data is not stored as numerical values but instead is
stored as textual data. DNA, for instance, is stored as a string of data. It is not possible
to slightly change the letter ’A’ to something else and so the simulated annealing process
must be altered.

22.7.1 Swapping Letters

Instead of changing single elements, the textual version will swap letters. For example the
proposed solution string could be

ABCDEFGH.

The swap would then propose

EBCDAFGH

as a possible solution.
The RandomSwap shown in Code 22.9 performs this swap of two letters in a string
of any length. The string is the input a and the length N is computed in Line 3. Line 4
creates a list of random integers of from 0 to N − 1. This list is shuffled in Line 5 and the
first two integers are used to indicate which two letters get swapped which occurs in the
last three lines.

22.7.2 A Simple Example

Consider a very simple example of rearranging the letters of the input string to match a
given pattern. The purpose of this example is to simply show how the algorithm works.
A real application would be more complicated but the ideas and steps would be about the
same.

309
Code 22.9 The RandomSwap function.

1 # simann3 . py
2 def RandomSwap ( a ) :
3 N = len ( a )
4 r = list ( range ( N ) )
5 np . random . shuffle ( r )
6 q = a [ r [0]]
7 a [ r [0]] = a [ r [1]]
8 a [ r [1]] = q

The first necessity is a cost function. A very simple one is shown in Code 22.10.
The input is the query string and the cost is the number of letters that are different
than the target string. Obviously, this is a stupid task, but it should be evident that
this cost function can be replaced by a more complicated cost function that is designed in
accordance with the user’s application. The function in Code 22.10 returns the number of
mismatched letters between the query and the target and a perfect match would produce
a cost of 0.

Code 22.10 The CostFunction function.


1 # simann3 . py
2 def CostFunction ( query ) :
3 target = ' a b c d e f g h i j k l m n o p q r s t u v w x y z '
4 cost = 0
5 for i in range (26) :
6 if query [ i ] != target [ i ]:
7 cost += 1
8 return cost

Now the simulated annealing process is ready to be employed. The driver function,
AlphaAnn, is shown in Code 22.11. Lines 3 through 5 create a string with a random
arrangement of the 26 letters. The annealing process begins in Line 9 where newguess
is the proposed query. Two letters are swapped in Line 10 and the cost of this proposed
string is computed in Line 11. If the cost is better then the newguess becomes the guess.
The iterations continue until the cost falls below 0.1 which in this case only occurs when
there is a perfect match.
Code 22.12 shows the call to AlphaAnn and the results. The output string has
become the target string.
The program has two default values as inputs and that current configuration will
not produce the correct answer. So, the call in line 2 increases the initial cost. It is also
possible to slow the temperature decay by increasing scaltmp to a value such as 0.9999.

310
Code 22.11 The AlphaAnn function.

1 # simann3 . py
2 def AlphaAnn ( temp =1.0 , scaltmp =0.99 ) :
3 abet = ' a b c d e f g h i j k l m n o p q r s t u v w x y z '
4 guess = list ( abet )
5 np . random . shuffle ( guess )
6 ok = True
7 cost = 99999
8 while ok :
9 newguess = copy . copy ( guess )
10 RandomSwap ( newguess )
11 gcost = CostFunction ( newguess )
12 if gcost < cost :
13 cost = gcost
14 guess = copy . copy ( newguess )
15 temp *= scaltmp
16 if cost < 0.1 or temp <0.01:
17 ok = False
18 return guess

Code 22.12 Using the AlphaAnn function.

1 >>> import simann3 as si3


2 >>> si3 . AlphaAnn (100)
3 [ 'a ', 'b ', 'c ', 'd ', 'e ', 'f ', 'g ', 'h ', 'i ', 'j ',
4 'k ', 'l ', 'm ', 'n ', 'o ', 'p ', 'q ', 'r ', 's ', 't ',
5 'u ', 'v ', 'w ', 'x ', 'y ', 'z ']

311
Simulated annealing may need to be run several times with different parameters to find
the best solution. This is a common practice.

22.7.3 Consensus String

This section presents a more realistic problem using the same ideas used in AlphaAnn. In
this new case there are several similar protein strings and the task is to find the consensus
string. The consensus string is the one string that best aligns with the set of input strings.
In this task there will be several protein strings, {xi , i = 0, ..., N − 1} and a single
query string q. The idea is to find the q that best aligns with all of the x’s.
This example will use the BLOSUM50 matrix and the BlosumScore function from
the blosum.py module to score the comparison between pairs of amino acids. Thus, the
score for a single acid in q is the sum of the scores of that amino acid compared to the
amino acids in the same position in all of the x strings.
There is no guarantee that the two strings will have the same length and so Line
4 finds the length of the shortest string. Line 5 beings the consideration of each pair of
letters. Lines 6 an 7 find the row and column number associated with the two letters and
Line 8 retrieves the value from the BLOSUM50 matrix. The scores are summed in sc
and divided by the length of the shortest string to compute the final score. An example
is shown in Code 22.13.
Code 22.13 An alignment score.

1 >>> import simann4


2 >>> s1 = ' DRNAQMRN '
3 >>> s2 = ' DSNACMRN '
4 >>> score = simann4 . BlosumScore ( simann4 . BLOSUM50 ,
5 simann4 . PBET , s1 , s2 )
6 >>> score
7 4.625

It will be necessary to compare a q string to several x vectors and the cost function
will be computed from all of them. Unfortunately, the cost function is not straightforward.
The BlosumScore computes a score which is better if the number is larger. The cost
function prefers the opposite where a lower number is better. In this case the score
is subtracted from a large number, based on the length of the string, so that a better
alignment produces a lower value and this is used for the cost.
Code 22.14 shows the function CostFunction which receives four arguments. The
seqs is a list of strings which are the x strings. The query is the q string. The mat and
abet are the substitution matrix and associated alphabet. The score each alignment is
computed and negative of this accumulated in a variable named cost. This negative value
is add to a large number on the last line. The largest value in BLOSUM 50 is 15 and so

312
the max score that can be achieved is 15 × L where L is the length of the query string.

Code 22.14 The CostFunction function.


1 # simann4 . py
2 def CostFunction ( seqs , query , mat , abet ) :
3 cost = 0
4 for i in range ( len ( seqs ) ) :
5 sc = BlosumScore ( mat , abet , seqs [ i ] , query )
6 cost -= sc
7 return cost + 15* len ( query )

This max alignment only occurs if both strings are filled with ‘W’s. Code 22.15
shows two examples. In the first there are two x strings and the query is not well matched
to either. The cost is computed to be 301.9. The second example starts in Line 5 in which
the q is changed to be the second string. As seen the cost this time is only 285.4. However,
the cost is not close to 0. In this problem the minimum cost is sought but it will not be
close to 0.

Code 22.15 Examples of the cost function.

1 >>> x = [ ' A R N D C Q E G H I L K M F P S T W Y V ' , ' A R N D C Q E H H I L K M F P S T W Y V ' ]


2 >>> q = ' A A A A A A A A A A A A A A A A A A A A '
3 >>> simann4 . CostFunction (x ,q , simann4 . BLOSUM50 , simann4 . PBET )
4 301.9
5 >>> q = ' A R N D C Q E H H I L K M F P S T W Y V '
6 >>> simann4 . CostFunction (x ,q , simann4 . BLOSUM50 , simann4 . PBET )
7 285.4

This test uses four x strings and the goal is to find the q string that best aligns
with all of them. Initially, the q will be a random string from the amino acid alphabet
and simulated annealing will be used to find the best q. Code 22.16 shown in Code 22.16
creates the four x strings. These are similar to each other but not perfectly matched.
This case is different than the one in Section 22.7.2. In that case there were 26
letters and each one could be used just once. In this case the letters are used multiple
times. So, it is not necessary to swap the letters as the annealing process can just simply
change a letter to another one in the alphabet. This process is performed in Code 22.17
which shows the RandomLetter function. Line 3 gets a random number between 0 and
20 (the length of the alphabet). This variable r is then used in Line 5 to get a single
random letter from the alphabet. Lines 6 and 7 find a random location in the query string
and Line 9 replaces the letter at that location with the random letter from Line 5. The
returned query is the modified string.
The function RunAnn in Code 22.18 performs the simulated annealing . The inputs

313
Code 22.16 The TestData function.
1 # simann4 . py
2 def TestData () :
3 seqs = []
4 seqs . append ( ' ARNDCQEGHILKMFPSTWYV ')
5 seqs . append ( ' ARNDCQEHHILKMFPSTWYV ')
6 seqs . append ( ' ARNDCQEAHILKMFPSTWYV ')
7 seqs . append ( ' ARNDCQEAHILKMFPSTWYV ')
8 return seqs

Code 22.17 The RandomLetter function.


1 # simann4 . py
2 def RandomLetter ( query , abet ) :
3 r = np . random . rand () * len ( abet )
4 r = int ( r )
5 rlett = abet [ r ]
6 r = np . random . rand () * len ( query )
7 r = int ( r )
8 nquery = copy . copy ( query )
9 nquery [ r ] = rlett
10 return nquery

314
are the set of sequences, the substitution matrix, the associated alphabet and the optional
arguments of temperature and decay constant. Lines 3 through 7 create the random string
q which is the string to be modified. The iterations begin in Line 10. The new guess is
created and its cost is computed in Line 11. If this cost is lower then the guess becomes
the newguess and the cost takes on the lower value. The process continues until one of
the conditions in Line 17 is met. The output is a string that best aligns with all of the
strings in the original set.

Code 22.18 The RunAnn function.


1 # simann4 . py
2 def RunAnn ( seqs , mat , abet , temp =1.0 , scaltmp =0.99 ) :
3 D = len ( seqs [0]) # length of strings
4 r = ( np . random . rand ( D ) * 20) . astype ( int )
5 guess = []
6 for i in r :
7 guess . append ( abet [ i ])
8 ok = True
9 cost = 99999
10 while ok :
11 newguess = RandomLetter ( guess , abet )
12 gcost = CostFunction ( seqs , newguess , mat , abet )
13 if gcost < cost :
14 cost = gcost
15 guess = copy . copy ( newguess )
16 temp *= scaltmp
17 if cost < 0.1 or temp <0.01:
18 ok = False
19 return guess

The example is executed in Code 22.19. Line 1 generates the best guess which is
returned as a list of single characters. Line 2 converts this to string and it is printed
on Line 3. Since the input data in seqs was strings that were highly similar then it is
expected that the guess string would also be similar each one of the input strings. The
first input string is printed in Line 5 and as seen there are strong similarities.

Code 22.19 Comparing the computed result to the original.

1 >>> guess = RunAnn ( seqs , BLOSUM50 , PBET ,1 ,0.999)


2 >>> ' ' . join ( guess )
3 ' ARND C Q E A H I L K M F P S T W Y V '
4 >>> seqs [0]
5 ' ARND C Q E G H I L K M F P S T W Y V '

315
The intent of the algorithm is to provide the sequence that best aligns with a set of
strings. Clearly the algorithm has put forth a viable candidate but it can not be stated
that this is the very best possible string. Often it is the case that this statement is not
possible to make and the user must understand that they have computed a very good
answer but it may not be the best.
One method of securing confidence in the answer is to run the simulation several
times. Since there is a random start to the process it is possible that different answers
can be produced. For this simulation Line 1 in Code 22.19 was repeated 10 times. In
all 10 trials the answer was example the same as Line 3 in Code 22.19. While this does
not prove that this is the best string, it does add confidence that this is one of the best
possible strings.

Problems

1. Given two data points at (0,0) and (1,0) respectively. Use simulated annealing to
find a point that is a distance of 1.0 from both of the data points. Repeat several
times with different seeds. How many solutions are there?

2. Given two data points at (0,0) and (1,0) respectively. Use simulated annealing to
find a point that is equidistant between the two data points although the length of
that distance has no restriction. Run several times with different seeds. Does the
algorithm repeatedly produce the same result?

3. Create 3 random vectors of length 3. Create a simulated annealing process that


attempts to find the a vector that has the same distance to all vectors.

4. The previous problem should have a very good answer. If the number of random
vectors increases to 4 then it is highly likely that a perfect answer is not possible.
Use simulated annealing to find the best answer.

5. Repeat the previous problem for several cases in which the number of input vectors
is 3, 4, 5, 6, ..., 10. Plot the cost of each trial’s final answer versus the number of
vectors.

6. Given to sequences AGTCGTAGCA and ACTCTAGGCA. Create a simulated an-


nealing program that will provide the best gapped alignment of these two sequences.

316
Chapter 23

Genetic Algorithms

Cases arise in which there is plenty of data generated but the optimal function that
could simulate the data is not known. For example, protein sequences are known and the
protein folding structures are also known, but missing is the knowledge of the function
that converts a protein sequence into a folded structure. This is a very difficult problem
with no easy solution. However, it illustrates the idea that plenty of data can be available
without knowing the exact function that associates them.
An approach to problems of this nature is to optimize a function through the use
of an artificial genetic algorithm (GA) . The idea of this system is that the GA contains
several genes each one encoding a solution to the problem. Some solutions are better than
others. New genes are generated from the old ones with the better solutions being more
likely to be used to create new solutions. The new generation of solutions should be better
than the previous and the process continues until a solution is reached or optimization has
ceased.
Genetic algorithms are quite easy to employ and provide good solutions to tough
problems. The downside to GAs is that they can be quite expensive to run on a computer.
Before delving into the GAs it is first worthy to explore a simpler optimization scheme
that naturally leads into GAs.

23.1 Energy Surfaces

Both simulated annealing and GA procedures attempt find a minimum in an energy surface
but in different ways. Figure 23.1 shows a simple view of an energy surface which can also
be considered as an error surface. The ball indicates the position of a solution and the
error that accompanies this solution is the y-value. The purpose of optimization is to find
the solution at the bottom of the deepest well.
In the case of simulated annealing there is a single solution that is randomly located
(since the initial target vector is random). Variations of this vector move this solution to a

317
Figure 23.1: A simple view of an energy surface.

different location. Of course, large variations equate to large displacements of the solution.
As the temperature is cooled the solution can not move around as much and eventually
gets trapped in a well, and further optimization moves the solution down towards the
bottom of the well. Of course, there is no guarantee that the solution will fall into the
correct well. The term “caught in a local minimum” is used to describe a solution that is
stuck in a well that is not the deepest.
The GA approach is different in that there are several solutions. This is similar to
placing several balls on the energy surface. The GA has a two step optimization process.
First new solutions are created from old solutions which is equivalent to replacing balls on
the surface with a new set in which the likelihood is that the newer balls will be closer to
the bottom of the wells. The second step moves the balls slightly through a mutation step.
The optimization occurs mostly be creating a new set of solutions that is better than the
old set of solutions.

23.2 The Genetic Algorithm Approach

A simple GA iterates over a set of steps as listed.

1. Create a cost function


2. Initialize the GA
3. Score the current genes
4. Iterate until the solution is good enough, stable, or an iteration limit is reached
(a) Create the next generation
(b) Score the new generation
(c) Replace the old generation
(d) Mutate
(e) Score the new generation
(f) Check for a stop condition

23.3 A Numerical GA

This section considers the case of applying a genetic algorithm to numerical data.

318
23.3.1 Initializing the Genes

The GA genes can be generated in several ways. Usually, the initial genes provide very
poor solutions, but that will change as the algorithm progresses. Two common methods
of generating the genes are:

1. Random vectors, and


2. Randomly selected data vectors.

Random vectors are just vectors with random values. However, the range of the
random values should match the range of the data values. If the values in the data vectors
range between -0.01 and +0.01 then the random values in the initial GA vectors should
also be in that range.
The second choice is to select random vectors from the data set. The advantage as
that these initial vectors will be in the same range as the data vectors. The disadvantage
is that the selected starting vectors may be similar by coincidence and data vectors that
are very dissimilar to the chosen ones are not represented. This is not a devastating
disadvantage though.

23.3.2 The Cost Function

The GA has several GA genes and the performance of these need to be evaluated by
either a scoring function or a cost function. A scoring function will produce a larger
value for better performance whereas a cost function will produce a lower value for better
performance. The advantage of a cost function is that perfection is a cost of 0, but there
is no single value that is the perfect score for all applications.
The cost function depends on the application. For example, if the purpose of the
GA is to find a sequence that best aligns with several sequences then the cost function
would measure the mismatches between the GA gene sequence and the other sequences in
the input. So, this function is written anew for each deployment of the GA.
The example of finding a vector that is perpendicular to others is repeated in this
chapter except that a GA is used to find the solution instead of simulated annealing. The
first step is to create the cost function. This function should return a cost of 0 if the input
vector is perpendicular to all vectors in a set. The dot product can be used to measure if
two vectors are perpendicular. If ~a ⊥ ~b then ~a · ~b = 0.
This is actually a very easy cost function to create. Consider two matrices both of
which are created from vectors which are stored as rows. The matrix-matrix multiplication
of the first matrix and the transpose of the second matrix computes the dot product of
all possible pairs of vectors. Thus, the cost function for this application requires only two
lines of Python script as shown in Code 23.1. Line 3 computes all of the dot products and
line 4 computes the sum of their absolute values. The output is a vector where each value
is associated with one of the GA genes. Thus, if there are 8 GA genes then the output

319
will have 8 values. If any of the values is 0 then the associated GA gene has provided the
perfect solution.

Code 23.1 The CostFunction function.


1 # ga . py
2 def CostFunction ( data , genes ) :
3 dprods = genes . dot ( data . T )
4 cost = ( abs ( dprods ) ) . sum (1)
5 return cost

23.3.3 The Crossover

Creating the next generation of solutions is a bit involved. The number of offspring is
usually equal to the number of parents and the offspring are generated in pairs. Thus
for each iteration two parents are chosen along with a splice point. The splice point is
a random location in the vectors and the first child is created from the first part of the
one parent and the second part of the other parent. The second child is created from
the second parts as shown in Figure 23.2. The parents are selected based upon their cost
functions such that the parents with a lower cost have a better chance of being selected
as one that will help generate the pair of children.

Figure 23.2: Two parents are spliced to create two children.

The creation of the children vectors is performed by the CrossOver function which
is not shown due to its length. The inputs are the GA genes and the costs of each of them.
Code 23.2 shows the use of this function. Line 1 generates the data, which in this case
are five vectors in R6 . Thus, it should be possible to find one vector that is perpendicular
to these five. Line 2 creates the GA genes. These are candidates and if one of them is
perpendicular to all of the vectors in data then a solution is found. Of course, the genes
are generated with random values and so none of these should provide a good solution.
This example only creates four such genes, but usually in an application there are many
more.
The costs are computed in line 3 and shown in line 5. Of course, none of these are
near 0. Line 6 uses the CrossOver function to create the next generation of GA genes.
The variable kids is a list of vectors which are converted to a matrix in line 7. The reason

320
Code 23.2 Employing the CrossOver function.

1 >>> data = np . random . ranf ((5 ,6) )


2 >>> genes = np . random . ranf ((4 ,6) )
3 >>> costs = ga . CostFunction ( data , genes )
4 >>> costs
5 array ([ 7.49271855 , 9.7418091 , 6.19295613 , 6.8607025 ])
6 >>> kids = ga . CrossOver ( genes , costs )
7 >>> kids = np . array ( kids )
8 >>> kcosts = ga . CostFunction ( data , kids )
9 >>> kcosts
10 array ([ 8.67938982 , 5.67403124 , 5.88282083 , 7.80285385])

that kids is returned as a list is that this function needs to be useful for the cases in which
the GA is manipulate non-numeric data such as in the case of finding the best aligning
string.
The costs of the kids are computed in line 8 and shown in line 10. It is expected
that some of the kids are better than any of the parents. This is seen to be true as two
of the kids have a lower cost than any of the parents. This process can be repeated as
shown in Code 23.3 and as seen the cost is even lower. However, the cost may not go to
zero and thus another step is needed.

Code 23.3 Employing the CrossOver function.

1 >>> genes = kids +0


2 >>> costs = kcosts + 0
3 >>> kids = ga . CrossOver ( genes , costs )
4 >>> kids = np . array ( kids )
5 >>> kcosts = ga . CostFunction ( data , kids )
6 >>> kcosts
7 array ([ 5.67403124 , 5.67403124 , 6.92581981 , 4.73237795])

23.3.4 Mutation

In the previous case there were 4 GA genes and the children were created by mixing and
matching parts of the parents. The children, however, can not obtain any value that does
not come from a parent. Line 2 in in Code 23.4 shows the first elements of the four GA
genes. Line 4 shows the first four elements of the four children genes. As seen the values
from the children came directly from the parents. It is not possible for a child gene to
have a value other than those from the parents.
The Mutation function will change some of the values in the GA genes so that

321
Code 23.4 The first elements.
1 >>> genes [: ,0]
2 array ([ 0.64494945 , 0.13895447 , 0.86637429 , 0.05408412])
3 >>> kids [: ,0]
4 array ([ 0.86637429 , 0.86637429 , 0.86637429 , 0.64494945])

values other than those from the parents can be obtained. Usually, only 5% of the values
are changed. Thus, for the case of 4 vectors with a dimension of 6 only one of the elements
will be changed. This function will find the maximum and minimum of the elements values
(in the previous case the max is 0.866 and the min is 0.054) and then expand that range by
a small amount. The newly generated value is a random number from within this range.
The reason that the range is expanded a small bit is that the perfect answer may be a
value that is lower or higher than all of the values in the parent genes. For example, if the
perfect value for the first element in the answer vector is 0.9 then the mutation process
needs to be able to generate a random number that is larger than any of the current values
in the GA genes.
The percentage of elements that can be changed in this process can be altered.
Usually, for numerical data a change of 5% of the total number of elements is acceptable.
If this value is too high then the GA algorithm does not benefit enough from the crossover
and if the value is too low then finding the correct solution can be a very long process.

23.3.5 Running the GA Algorithm

All of the components are in place and so it is possible to run the GA algorithm. Code
23.5 shows the function DriveGA which shows a typical run. The input are a set of
vectors and the goal is to find a single vector that is perpendicular to all of these. Thus,
the number of vectors needs to be one less than the dimensions. The other inputs are the
number of GA genes, the dimension of those genes and a tolerance.
The random GA genes are created in line 3 and their cost is computed in line 4.
The children and their costs are computed in lines 7 through 9. A mutation is enforced
and the new costs are determined. The best cost and location of that cost are collected
in lines 13 and 14 and if one of the GA genes has a cost that is below the tolerance then
the program terminates and returns that best GA gene.
A single run is shown in Code 23.6. The input contains two vectors of which the
answer is known. These are vectors pointing in the x and y directions in three dimensions
and so the answer should point in the z direction and as seen this is true within the
specified tolerance.
This answer can be confirmed, however the cost function is expecting a matrix for
the genes input. So, line 5 converts the vector best into a single row matrix. This is a
suitable input for the CostFunction function and as see in line 7 the cost of this gene is

322
Code 23.5 The DriveGA function.
1 # ga . py
2 def DriveGA ( vecs , NG =10 , DM =10 , tol =0.1 ) :
3 folks = np . random . ranf (( NG , DM ) )
4 fcost = CostFunction ( vecs , folks )
5 ok = 1
6 while ok :
7 kids = CrossOver ( folks , fcost )
8 kids = np . array ( kids )
9 kcost = CostFunction ( vecs , kids )
10 folks = kids + 0
11 Mutate ( folks , 0.05 )
12 fcost = CostFunction ( vecs , folks )
13 best = fcost . min ()
14 besti = fcost . argmin ()
15 if best < tol :
16 ok = 0
17 return folks [ besti ]

Code 23.6 A typical run.

1 >>> data = np . array (((1 ,0 ,0) ,(0. ,1 ,0) ) )


2 >>> best = ga . DriveGA ( data , NG =4 , DM =3 , tol =0.01)
3 >>> best
4 array ([ 0.00470562 0.0029165 0.53858752])
5 >>> best = np . array ( [ best ] )
6 >>> ga . CostFunction ( data , best )
7 array ([ 0.00762212])

323
below the tolerance.

23.4 Non-Numerical GA

In the previous example the genes in the GA were numerical vectors. There are cases,
especially in bioinformatics, in which the information being manipulated is based on letters
instead of numbers. The GA is a flexible approach and allows for adaptations to suit the
idea of the GA to particular applications. To demonstrate this by example the small
problem of sorting data will be used.
In this problem the goal of the GA is to sort letters of the alphabet. This will use
a trivial cost function of matching a sequence from a gene to a target sequence. More
complicated applications will require a more complicated cost function, but the rest of
this section should be usable without significant alteration.

23.4.1 Manipulating the Strings

Before the GA can be applied to text data it is important to review some of the methods by
which strings are manipulated in Python. First, the lowercase alphabet can be retrieved
by typing it directly or using the ascii lowercase function from the string module. Line
2 in Code 23.7 retrieves this string and converts it to a list of individual letters. This
conversion is necessary since it is not possible to change the contents of a string directly.

Code 23.7 Copying textual data.

1 >>> import string


2 >>> ape = list ( string . ascii_lowercase )
3 >>> folks = []
4 >>> ape = list ( abet )
5 >>> np . random . shuffle ( ape )
6 >>> folks . append ( copy . copy ( ape ) )
7 >>> np . random . shuffle ( ape )
8 >>> folks . append ( copy . copy ( ape ) )

The GA will need to start with several random genes. In the numerical case this was
a vector of random values. In the text case it will need to be a string of randomly arranged
letters. Each GA gene will need to have all 26 letters but arranged in a different order.
Line 3 is an empty list that will eventually contain these randomly arranged alphabets.
Line 4 creates a duplicate alphabet named ape. This is a list of single characters and
not a string. This list can be rearranged using the shuffle function from the np.random
module. The contents of ape are rearranged and this list is appended in line 6. Note that
the copy function is used to create a duplicate of this list. If folks.append(ape) is used

324
then each entry in the list will be a list in the same place in memory instead of individual
strings. Each time that ape is changed all of the lists inside of folks will also be changed.
The use of copy.copy creates a wholly different arrangement of the letters and appends
it to folks.
The Jumble function shown in Code 23.8 creates the random strings. The inputs
are the alphabet, which in this case is all lowercase letters. However, this function is
adaptable to other applications. For example, if a random DNA strings are desired then
abet is a list of the four DNA letters. The variable ngenes is the number of genes desired.
For GA applications this should be an even number since the child genes are created in
pairs.

Code 23.8 The Jumble function.


1 # gasort . py
2 def Jumble ( abet , ngenes ) :
3 folks = []
4 ape = copy . copy ( abet )
5 for i in range ( ngenes ) :
6 np . random . shuffle ( ape )
7 folks . append ( copy . copy ( ape ) )
8 return folks

Code 23.9 calls the Jumble function and demonstrates that the GA genes are
different from each other. Line 5 converts the list of characters back to a string for
easy viewing. The join function is reviewed in Code 6.38.

Code 23.9 Using the Jumble function.

1 >>> import gasort


2 >>> np . random . seed ( 1256 )
3 >>> abet = list ( string . ascii_lowercase )
4 >>> folks = gasort . Jumble ( abet , 10 )
5 >>> ' ' . join ( folks [0] )
6 ' tcdesokupmyzvahrqgnjwxilfb '
7 >>> ' ' . join ( folks [1] )
8 ' nwicamvfxqdjterzplouhgkysb '

23.4.2 The Cost Function

Every application of the GA algorithm requires a unique cost function. In this application
the goal is to create a string that is sorted in alphabetical order. This is a very simple
application, but the goal is to demonstrate how the functions are used instead of generating
new, previously unknown answers. The CostFunction function shown in Code 23.1 shows

325
this simplistic cost function. Basically, it compares every list of characters in genes to the
target and counts the number of mismatches. Thus, a perfect cost is 0 and the absolute
worst cost is 26. As seen in the last lines, random strings have a high cost, but this is
expected.

Code 23.10 The CostFunction function.


1 # gasort . py
2 def CostFunction ( target , genes ) :
3 NG = len ( genes ) # number of genes
4 cost = np . zeros ( len ( genes ) )
5 k = 0
6 for gene in genes :
7 c = 0
8 for i in range ( len ( target ) ) :
9 if target [ i ] != gene [ i ]:
10 c += 1
11 cost [ k ] = c
12 k += 1
13 return cost
14

15 >>> fcost = gasort . CostFunction ( string . ascii_lowercase ,


folks )
16 >>> fcost
17 array ([ 25. , 26. , 26. , 25. , 24. , 24. , 25. , 24. , 25. ,
26.])

23.4.3 The Crossover

The CrossOver function is capable of creating children genes for either numerical or
textual data. Therefore, a new CrossOver function is not required. However, the
CrossOver function will produce some children that are undesirable for this particu-
lar application. In this project it is required that all strings have each character only once.
The CrossOver function will produce children genes that violate this rule. Therefore,
it is necessary to create a new function that will ensure that all of the children have the
requisite alphabet.
Code 23.11 shows the Legalize function that ensures that all genes have each letter.
The inputs are the valid letters and a single GA gene. So, if there are 10 GA genes this
function will need to be called 10 times. Lines 6 and 7 count the number of times each
letter occurs in the gene. If the gene were legal then this count would be 1 for all letters.
However, if a letter is duplicated then another letter is missing. So, lines 8 and 9 get
the indexes of this missing letters and the duplicate letters. For example, if the letter ‘a’
occurs twice and the letter ‘c’ is missing then mssg would a list with the single entry 2

326
because valid[2] is the letter ‘c’. Likewise, the list of duplicates, dups, would have a
single entry 0 indicating that it is the first letter in valid that is duplicated. If the gene
has more letters that are duplicated and missing then the lists mssg and dups would be
longer.

Code 23.11 The Legalizefunction.

1 # gasort . py
2 def Legalize ( valid , gene ) :
3 LV , LG = len ( valid ) , len ( gene )
4 cnts = np . zeros ( LV , int )
5 lgene = list ( gene )
6 for i in range ( LV ) :
7 cnts [ i ] = lgene . count ( valid [ i ] )
8 mssg = np . nonzero ( cnts ==0 ) [0]
9 dups = np . nonzero ( cnts ==2 ) [0]
10 np . random . shuffle ( dups )
11 for i in range ( len ( mssg ) ) :
12 k1 = lgene . index ( valid [ dups [ i ]] )
13 k2 = lgene . index ( valid [ dups [ i ]] , k1 +1 )
14 if np . random . rand () > 0.5:
15 me = k1
16 else :
17 me = k2
18 gene [ me ] = valid [ mssg [ i ]]
19

20 >>> test = list ( ' a b a d e f g h i j k l m n o p q r s t u v w x y z ' )


21 >>> gasort . Legalize ( string . ascii_lowercase , test )
22 >>> ' ' . join ( test )
23 ' abcdefghijklmnopqrstuvwxyz '
24 >>> test = list ( ' a b a d e f g h i j k l m n o p q r s t u v w x y z ' )
25 >>> gasort . Legalize ( string . ascii_lowercase , test )
26 >>> ' ' . join ( test )
27 ' cbadefghijklmnopqrstuvwxyz '

The for loop starting in line 11 contains the process of replacing the duplicates with
the missing. The duplicate list is rearranged in line 10. The variables k1 and k2 are the
indexes of duplicates. Lines 14 through 17 determines which one of the duplicates is to be
replaced, and line 18 performs the replacement.
Two tests are shown in this Code. Line 20 creates a test string in which the letter
‘c’ is missing and the letter ‘a’ is duplicated. The first test shows the result in line 23
which returns replaces the ‘a’ with a ‘c’. However, the selection is a random process. The
second test shows that either ‘a’ can be replaced.

327
Code 23.12 shows the use of the Legalize function. The children genes are created
in line 1 and each is sent to the Legalize function to ensure that all letters exist in each
gene. The cost of the children can be computed and as expected the costs are slightly
lower.

Code 23.12 Using the Legalizefunction.

1 >>> kids = ga . CrossOver ( folks , fcost )


2 >>> for i in range ( len ( kids ) ) :
3 gasort . Legalize ( string . ascii_lowercase , kids [ i ] )
4 >>> kcost = gasort . CostFunction ( string . ascii_lowercase , kids
)
5 >>> kcost
6 array ([ 23. , 26. , 25. , 25. , 24. , 24. , 22. , 23. , 25. ,
25.])

23.4.4 Mutation

The Mutation function also has to be changed. In the numerical case the mutation was
to alter the numerical values. In this case, the mutation is to swap the position of two
letters.
A simple mutation function is shown in Code 23.13 . Simple random locations are
selected to swap letters.

Code 23.13 The modified Mutate function.


1 # gasort . py
2 def Mutate ( genes , rate ) :
3 NG = len ( genes )
4 for i in range ( NG ) :
5 DM = len ( genes [ i ] )
6 r = ( np . random . rand ( DM ) < rate ) . nonzero () [0]
7 for j in r :
8 k = int ( np . random . rand () * DM )
9 a = genes [ i ][ k ]
10 genes [ i ][ k ] = genes [ i ][ j ]
11 genes [ i ][ j ] = a
12

13 >>> Mutate ( folks , 0.05 )

328
23.4.5 Running the GA for Text Data

All of the components are in place and so it is possible to run this example test. The
DriveSortGA function shown in Code 23.14 performs the complete task. This follows
the same protocol as the numerical case with the inclusion of the Legalize function.

Code 23.14 The DriveSortGA function.


1 # gasort . py
2 def DriveSortGA ( ) :
3 target = list ( string . ascii_lowercase )
4 alpha = list ( string . ascii_lowercase )
5 folks = Jumble ( alpha , 10 )
6 ok = 1
7 fcost = CostFunction ( target , folks )
8 while ok :
9 kids = ga . CrossOver ( folks , fcost )
10 for k in range ( len ( kids ) ) :
11 kids [ k ] = list ( kids [ k ] )
12 for g in kids :
13 Legalize ( alpha , g )
14 folks = copy . deepcopy ( kids )
15 Mutate ( folks , 0.01 )
16 fcost = CostFunction ( target , folks )
17 if fcost . min () == 0:
18 ok = 0
19 me = fcost . argmin ()
20 return folks [ me ]
21

22 >>> answ = gasort . DriveSortGA ()


23 >>> ' ' . join ( answ )
24 ' abcdefghijklmnopqrstuvwxyz '

The final lines show the call to the function and the ensuing results. As seen this
GA has performed the simple task of sorting letters alphabetically. In a real application
the cost function would be replaced to accommodate the user’s task, but the steps shown
in DriveSortGA would be the same.

23.5 Summary

Machine learning encompasses a field in which the program attempts to train on the
data at hand. There are several algorithms in this field and one that is widely used in
bioinformatics is the Genetic Algorithm (GA). The GA contains a set of data genes (which

329
can be vectors, strings, etc.) and through several iterations attempts to modify the genes
to provide an optimal solution. This requires the user to define the metric for measuring
the optimal solution. The unique quality of a GA is that new genes are constructed by
mating old genes. New genes are generated from copying parts of the older genes. GAs
tend to use many iterations and can be quite costly to run. However, they can provide
solutions that are more difficult to get using simpler methods.

Problems

1. Create a GA that starts with random DNA strings of length N . Create a cost
function such that the GA will compute the complement of a DNA target string.

2. Given a 8 random vectors of length 9. Create a GA program that will find the vector
that is perpendicular to the original 8 that also has a length of 1.

3. Consider the parody problem in which the training data is (000:0), (001:1), (010:1),
(011:0), (100:1), (101:0), (110:1), and (111:0) where (x1 x2 x3 : y) is a three dimen-
sional input and its associated one dimensional output. Create a GA that determines
the coefficients a, b, c, d, e, f, g, h in the function y 0 = ax1 x2 + x3 + bx1 x2 + cx1 x3 +
dx2 x3 + ex1 + f x2 + gx3 + h.

4. Given the same data as in the previous problem, create a GA that determines the
coefficients for the function z = Γ(ax1 + bx2 + cx3 ) and y 0 = dx1 + ex2 + f x3 + gz,
where Γ(w) = 1 if w > 0.5 and is 0 otherwise.

5. Create a GA that creates a consensus sequence in which the cost is twice as low if
the GA gene is one of the original training sequences. Use the data from Code 22.16.

6. Given to sequences AGTCGTAGCA and ACTCTAGGCA. Create a GA program


that will provide the best gapped alignment of these two sequences.

330
Chapter 24

Multiple Sequence Alignment

Aligning two sequences is relatively straight forward. Aligning multiple sequences adds a
new complication and there are two types of approaches. The greedy approach attempts
to find the best pairs of sequences that align and to build on those alignments. The non-
greedy approach attempts to find the best group of alignments. The advantages of the
greedy approach are that the programming is not too complicated and this system runs
fast. The advantage of the non-greedy system is that the performance is usually better.

24.1 Multiple Alignments

Figure 24.1 is a standard depiction of multiple sequence alignment. There are four se-
quences labeled A, B, C and D. Each one has an associated arrow. Any arrow pointing
to the left means that the complement of the sequence is being used. The position of the
arrows shows the shift necessary to make them align.

Figure 24.1: Aligning sequences.

There are two issues that need to be raised. The first is that some alignments have
disagreements and the second is the issue of using complements.
Consider a case in which sequences A and B are aligned as shown and that this
alignment has good agreement. In the overlapping regions the letters in A match with the
letters in B. Now, consider the cases of aligning B and C. Again, in the overlapping regions
the two sequences are in agreement. However, there is no guarantee that the segment of A
and C that overlap without also overlapping B are in agreement. Since there are repeating
and similar sequences throughout a genome, this type of problem is possible.

331
The second issue is that of complements. In the rest of this chapter complements
will not be used because that would unnecessarily complicate the discussion on alignment
multiple sequences. However, in many applications it is necessary to consider the comple-
ment. In these cases the sequencing machine can provide a sequence but does not indicate
on which DNA strand it resides. Therefore, it is necessary to consider aligning a sequence
or its complement. Once one of these is used the other needs to be removed from further
consideration.

24.2 The Greedy Approach

Two types of algorithms will be considered here. The first is a greedy approach and the
second, in Section 24.3 is the non-greedy approach.
In the greedy approach the algorithm will consider all alignment pairs and begin
building the assembly from the best pairs. This approach is faster and less accurate than
the non-greedy algorithm. The best alignments will be joined together to create a contig
which is a contiguous string of DNA.
It is possible that multiple contigs will need to be created during the construction of
the assembly. Consider the alignment shown in Figure 24.2. Sequences A and B strongly
align and create a contig. The next best align is C with D. These create a different contig.
The third best alignment is B with C. This can be used to join the two contigs to create
a single assembly as shown.

Figure 24.2: Aligning sequences with strong and weak overlaps.

The greedy approach starts with a comparison of all pairs of sequences. If we had
four sequences then we would compute the following alignments (s1, s2), (s1, s3), (s1, s4),
(s2, s3), (s2, s4), and (s3, s4). This information can be contained into a triagonal matrix,
M,

 
0 s1, s2 s1, s3 s1, s3
0 0 s2, s3 s2, s4
M =
0
. (24.1)
0 0 s3, s4
0 0 0 0

Each element of M keeps the score of the alignment of two sequences. Assuming that a
large score indicates a better match we can find the best of all possible pairings by finding
the largest value in the matrix.

332
24.2.1 Data

The data used in the examples must have the property of overlapping subsequences. For
now these overlaps will be perfect and the sequences will not have gaps. The function
ChopSeq shown in Code 24.1 receives an input sequence and the desired number of
subsequences and the length of these subsequences of which all with have the same length.
Most of the segments will be selected at random and so there is no guarantee that the
first or last part of the input sequence will be included. So lines 4 and 5 put the first
and last part of the input sequence into the list segs. The variable laststart is the last
location in the sequence were a segment can begin. Any location after that will produce a
shorter segment because it has reached the end of the input sequence. The for loop then
extracts segments at random locations. There is no guarantee that every element in the
input sequence will be included in the segments.

Code 24.1 The ChopSeq function.

1 # aligngreedy . py
2 def ChopSeq ( inseq , nsegs , length ) :
3 segs = []
4 segs . append ( inseq [: length ] )
5 segs . append ( inseq [-length :] )
6 laststart = len ( inseq ) - length
7 for i in range ( nsegs-2 ) :
8 r = int ( np . random . rand () * laststart )
9 segs . append ( inseq [ r : r + length ] )
10 return segs

Code 24.2 shows the use of this function. The sequence is created in line 2 and the
segments are extracted in line 3. This will create 10 sequences each of length 8. The rest
show that the first two sequences are the beginning and ending of the initial sequence and
the rest are from random locations.
The number of sequences and their lengths depends on the sampling that one desires.
Usually, the minimum is 3-fold sequencing which means that each element in the input
sequence should appear on average in three segments. Of course, with random selection
some elements will appear in more. In this case the input sequence is 26 elements long. If
3-fold sequencing is desired then the output should have a total of 26 × 3 = 78 elements.
The desire is that each segment have a length of 8 so 10 sequences will be required since
78/8 = 9.75. Better performance is achieved if the value of n in n-fold sequencing is
increased. If the desire is to have 6-fold sequencing then 20 segments of length 8 will be
needed.
The final comment on data is that each segment will need an identifier. In real
applications this could be the name of the gene in the sequence or some name that identifies
which experiment produced the data. In this case, the data will be faked and therefore

333
Code 24.2 Using the ChopSeq function.

1 >>> import aligngreedy as ang


2 >>> seq = ' a b c d e f g h i j k l m n o p q r s t u v w x y z '
3 >>> segs = ang . ChopSeq ( seq , 10 , 8 )
4 >>> segs [0]
5 ' abcdefgh '
6 >>> segs [1]
7 ' stuvwxyz '
8 >>> segs [2]
9 ' cdefghij '
10 >>> segs [3]
11 ' pqrstuvw '

the names of the sequences will simply be ’s0’, ’s1’, ’s2’, etc.

24.2.2 Theory of the Assembly

In the greedy approach pairs of alignments will be considered. Consider a single pair which
has two sequences designated as sa and sb. The matrix M is used to determine which
sequences are to be aligned. The maximum value in M corresponds to two sequences and
these are then considered to be sa and sb.
There are four choices which are listed below.

1. If neither sa or sb exist in any contig then a new contig is created.


2. If sa belongs in a contig but sb does not then sb is added to the contig that contains
sa. If sb belongs to a contig and sa does not then add sa to the contig with sb.
3. If sa and sb belong to different contigs then the two contigs are joined.
4. If sa and sb belong to the same contig then nothing is changed.

Initially, there are no contigs and so only the first choice is possible. Then as other
pairs of alignments are considered the other choices come into play.
The process repeats until all elements in M that are above a user specified threshold
are considered. There is no guarantee that all of the contigs will be joined together. It
is possible that at least one element in the input string does not appear in any segment.
In that case the two contigs will not overlap and so the final assembly includes multiple
contigs.

24.2.3 An Intricate Example

This example follows all of the steps necessary to make an assembly using an amino acid
string. There are several functions here are which are not explored in detail but rather are

334
just discussed and then used. Readers interested in how the functions work are invited to
explore the functions in aligngreedy.py.
This example is divided into sections. First there is the collection of the data, second
is the computation of pairwise alignments, third is the creation of initial contigs, fourth
is the process of adding sequences to existing contigs, fifth is the joining of contigs and
finally there is a driver function that can be called to create an assembly.

24.2.3.1 Data

For this example a protein from a bacteria is used. Code 24.3 shows the necessary steps.
The file is read in line 3 and one of the proteins is extracted in line 5. This particular
protein has 185 amino acids.

Code 24.3 Extracting a protein.

1 >>> import genbank as gb


2 >>> filename = ' data / AB001339 . gb . txt '
3 >>> gbdata = gb . ReadFile ( filename )
4 >>> klocs = gb . FindKeywordLocs ( gbdata )
5 >>> prot = gb . Translation ( gbdata , klocs [1] )
6 >>> len ( prot )
7 185

The next step is to chop of this sequence into subsequences such that overlaps are
common. Line 2 in Code 24.4 creates 8 substrings that are 50 characters long. Thus, each
string is about one-fourth of the original string. This is an uncommonly long segment for
such a string, but it facilitates the discussions of the example. Line 1 sets an initial random
seed which is used only if the reader wishes to replicate the results in the following codes.
If this line is changed or eliminated then the cut locations in creating the substrings will
be different and the results will not mirror the following examples.

Code 24.4 Creating the segments.

1 >>> np . random . seed ( 72465 )


2 >>> segs = ang . ChopSeq ( prot , 8 , 50 )
3 >>> ids = [ ' s0 ' , ' s1 ' , ' s2 ' , ' s3 ' , ' s4 ' , ' s5 ' , ' s6 ' , ' s7 ' ]

24.2.3.2 Pairwise Alignments

The greedy approach relies on the pairwise alignments of the sequences. Thus, all possible
pairs are aligned and scored. For each alignment there are two values that are kept.

335
The first is the score of the alignment and the second is the shift required to make this
alignment. These values are returned as two matrices.
The FastMat function is used in line 1 of Code 24.5 to compute the alignment
of all possible pairs. Since there are 8 sequences the returned matrix is 8 × 8. The
matrix M contains the scores of the alignments using the BruteForceSlide function. It
is not necessary to align a sequence with itself and so the diagonal elements are 0. The
alignment score for sequence A with sequence B is the same as sequence B with sequence
A and therefore only half of the matrix is populated. As seen some of the scores are quite
high (above 90) and many are very low. The sequence pairs that had overlap create a
high score and those that had no overlap create low scores. The user must decide what is
a valid alignment which is the same as setting a threshold of acceptance. If the threshold
is too high then sequences with some overlap will be discarded. If the threshold is too low
then the program will align sequences with bad matches. The threshold value is dependent
on the sequence length, the scoring algorithm and the substitution matrix that is used.
Commonly, a threshold of less than half of the maximum is sufficient. In this case the
threshold is set at γ = 50. It should be noted that the selection of the threshold is not
critical. The same results can be obtained with a higher threshold.

Code 24.5 Pairwise alignments.

1 >>> M , L = ang . FastMat ( segs , blosum . BLOSUM50 , blosum . PBET )


2 >>> M . max ()
3 331
4 >>> M
5 array ([[ 0 , 20 , 260 , 5 , 255 , 15 , 7 , 15] ,
6 [ 0, 0 , 28 , 312 , 23 , 4 , 10 , 91] ,
7 [ 0, 0, 0 , 13 , 331 , 25 , 57 , 13] ,
8 [ 0, 0, 0, 0 , 15 , 13 , 12 , 154] ,
9 [ 0, 0, 0, 0, 0 , 24 , 62 , 18] ,
10 [ 0, 0, 0, 0, 0, 0 , 254 , 119] ,
11 [ 0, 0, 0, 0, 0, 0, 0 , 35] ,
12 [ 0, 0, 0, 0, 0, 0, 0, 0]])
13 >>> L [:4 ,:4]
14 array ([[ 0 , 28 , 61 , 19] ,
15 [ 0 , 0 , 83 , 41] ,
16 [ 0 , 0 , 0 , 8] ,
17 [ 0 , 0 , 0 , 0]])

It will be necessary to align pairs of sequences as the contigs are constructed. Thus,
it is prudent to store the shifts required to achieve the alignment scores. These are stored
in matrix L of which a few of the values are shown here. These will be used later.

336
24.2.3.3 Initial Contigs

The assembly will consist of one or more contigs. In Python the assembly will be a list
of contigs. Each contig is itself a list which contains information about each string in the
contig. Each of these representations is a list of two items: the string name and the shifted
string.
Line 1 in Code 24.6 creates an empty list that will soon be populated. In the greedy
approach the best alignments are considered first. These alignments have the largest values
in the matrix M. Line 2 uses the divmod function to function the location of the largest
value in the matrix (see Code 11.23). In this example, the largest value is at M[2,4] which
indicates that the sequences segs[2] and segs[4] are the two that align the best in this
data set. The value of M[2,4] is 331 which indeed is the largest value in the matrix.

Code 24.6 Starting the assembly.

1 >>> smb = []
2 >>> v , h = divmod ( M . argmax () , 8 )
3 >>> v,h
4 (2 , 4)

The function ShiftedSeqs returns two sequences after alignment. Basically, it puts
the correct number of periods in front of one of the sequences to align them. This correct
number is based on the lengths of the sequences and the shift value stored in the L matrix.
Code 24.7 shows this first alignment. As this is the highest scoring alignment it is expected
that the overlap is significant. As seen in line 6 only one period was required to create the
alignment.

Code 24.7 Use the ShiftedSeqs function.

1 >>> import aligngreedy as ang


2 >>> sa , sb = ang . ShiftedSeqs ( segs [ v ] , segs [ h ] , L [v , h ] )
3 >>> sa
4 ' LLLTSNGKLQHRLAHREFAHQRTYFAQVEGSPTDEDLEPLRRGITFADYP '
5 >>> sb
6 '. LLTSNGKLQHRLAHREFAHQRTYFAQVEGSPTDEDLEPLRRGITFADYPT '

This is the first pair of sequences considered and therefore the only possible action
is to create a new contig. This is accomplished by the NewContig function. Line 1 in
Code 24.8 shows the use of this function. It receives the assembly smb, the two aligned
sequences, and their names. This will create a list with two items. Each of these items is
a list which contains the name and aligned sequence.
The function ShowContigs is used to display the assembly. Currently, the assem-
bly consists of a single contig. If there are more than one contig then a newline will

337
Code 24.8 Using the NewContig function.

1 >>> ang . NewContig ( smb , sa , sb , names [ v ] , names [ h ] )


2 >>> ang . ShowContigs ( smb )
3 s2 LLLTSNGKLQHRLAHREFAHQRTYFAQVEGSPTDEDLEPLRRGITFADYP
4 s4 . LLTSNGKLQHRLAHREFAHQRTYFAQVEGSPTDEDLEPLRRGITFADYP

separate them in the display. This function shows the first 50 bases in the alignment.
The function can receive a second argument that will start the display at a new location.
Thus, ang.ShowContigs{smb, 90} will show the first 50 bases starting at location 90.
This completes the processing of this best pairwise alignment. The next step is to
consider the second best pairwise alignment. To find this alignment the largest value in
M is replaced with a 0. This is shown in line 1 of Code 24.9. Now the largest element in
M is indicative of the second best pairwise alignment. This value is 312 which is still well
above the threshold of γ = 50. The location of this second best alignment is 1, 3. This
indicates that this the alignment uses two strings that are not yet in the assembly.

Code 24.9 Finding the next largest element.

1 >>> M [v , h ] = 0
2 >>> M . max ()
3 312
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (1 , 3)

For each pairwise alignment there are four choices as depicted in the bulleted list
in Section 24.2.2. Since neither of the sequences are in any other contigs the choice is to
create a new contig. This is shown in Code 24.10. The aligned sequences are created in
line 1 and placed in a new contig in line 2. Line 3 calls the ShowContigs function which
now displays the two contigs separated by a new line.

Code 24.10 Creating a second contig.

1 >>> sa , sb = ang . ShiftedSeqs ( segs [ v ] , segs [ h ] , L [v , h ] )


2 >>> ang . NewContig ( smb , sa , sb , names [ v ] , names [ h ] )
3 >>> ang . ShowContigs ( smb )
4 s2 LLLTSNGKLQHRLAHREFAHQRTYFAQVEGSPTDEDLEPLRRGITFADYP
5 s4 . LLTSNGKLQHRLAHREFAHQRTYFAQVEGSPTDEDLEPLRRGITFADYP
6

7 s1 ......... G K S A A T W C L T L E G L S P G Q W R P L T P W E E N F C Q Q L L T G N P N G P
8 s3 TGRSPQQGKGKSAATWCLTLEGLSPGQWRPLTPWEENFCQQLLTGNPNGP

338
24.2.3.4 Adding to a Contig

The process repeats as shown in Code 24.11. Line 1 removes the largest value and the
next largest value is found to be above the threshold. The location is at 0,2. In this case
one of the sequences is already in a contig and so a new decision is required. Instead of
creating a new contig the task is to add segs[0] to the contig that contains segs[2].
There are a couple of steps required to do this. First the location of segs[2] is required.
It will be required to know the location and which contig this sequence is in. After that
then the alignment of the two sequences will have to be adjusted to also align with the
sequence currently in the contig.

Code 24.11 Determining that the action is to add to a contig.

1 >>> M [v , h ] = 0
2 >>> M . max ()
3 260
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (0 , 2)
7 >>> sa , sb = ang . ShiftedSeqs ( segs [ v ] , segs [ h ] , L [v , h ] )
8 >>> ang . Finder ( smb , ids [2] )
9 (0 , 0)

The Finder function determines the location of the sequence within a contig. This
is shown in lines 8 and 9. This function returned two values. The first is the contig and
the second is the location in the contig. In this case, segs[2] is located in the first contig
and is the first sequence in that contig.
The next step is to add segs[0] to the first contig. In order to do this the sequence
sb (which is segs[2] aligned with segs[0]) must be synchronized with the sequence
segs[2] which is in the contig. In this case, several periods are required at the beginning
of sb in order to align it with sa. The sb is segs[2] with prefix periods and the first
sequence in the contig is segs[2] without any prefix periods. In order to align segs[0]
with all sequences in the first contig it will be necessary to insert prefix periods to all items
in the first contig such that sb aligns with the first sequence.
Code 24.12 shows this process. The Add2Contig function is called in line 1. This
receives the assembly, the sequence that is already in a contig, the sequence that is to be
added to the contig, the name of that sequence, and the two values returned by Finder.
This will align the new sequence with the contig. As seen in the display it was necessary
to add several periods in front of all of the items previous in the first contig in order to
align it with the new sequence.
In this case all of the items in the needed prefix periods. There is a second case that
also has to be considered by the Add2Contig function. In the future it may be necessary
to add a new sequence to this contig because it aligned with segs[4]. The sequence in the

339
Code 24.12 Using the Add2Contig function.

1 >>> ang . Add2Contig ( smb , sb , sa , ids [ v ] , 0 , 0 )


2 >>> ang . ShowContigs ( smb )
3 s2 ........... L L L T S N G K L Q H R L A H R E F A H Q R T Y F A Q V E G S P T D E D L E P
4 s4 ............ L L T S N G K L Q H R L A H R E F A H Q R T Y F A Q V E G S P T D E D L E P
5 s0 MGRLDQDSEGLLLLTSNGKLQHRLAHREFAHQRTYFAQVEGSPTDEDLEP
6

7 s1 ......... G K S A A T W C L T L E G L S P G Q W R P L T P W E E N F C Q Q L L T G N P N G P
8 s3 TGRSPQQGKGKSAATWCLTLEGLSPGQWRPLTPWEENFCQQLLTGNPNGP

contig already has several prefix periods and these will need to be added to the incoming
sequence to align it with the rest of the contig. The function Add2Contig considers all
of the necessary prefix additions to make the alignment valid.
The next largest value in M indicates that segs[0] aligns will with segs[4]. These
two are already in the same contig and so nothing needs to be done.

Code 24.13 Do nothing.

1 >>> M [v , h ] = 0
2 >>> M . max ()
3 255
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (0 , 4)

The process continues. New contigs are created or sequences are added to contigs
as necessary. Code 24.14 shows that the next best pairwise alignment is for sequences
segs[5] and segs[6]. Neither of these are in a contig and so a new contig is created.

Code 24.14 The third contig.

1 >>> M [v , h ] = 0
2 >>> M . max ()
3 254
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (5 , 6)
7 >>> sa , sb = ang . ShiftedSeqs ( segs [ v ] , segs [ h ] , L [v , h ] )
8 >>> ang . NewContig ( smb , sa , sb , names [ v ] , names [ h ] )

Code 24.15 shows that the next best alignment is for segs[3] and segs[7]. The
segs[3] is already in a contig and the Finder program indicates that it is the second

340
item in the second contig. So, Add2Contig adds segs[7] to this contig.

Code 24.15 Adding to a contig.

1 >>> M [v , h ] = 0
2 >>> M . max ()
3 154
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (3 , 7)
7 >>> sa , sb = ang . ShiftedSeqs ( segs [ v ] , segs [ h ] , L [v , h ] )
8 >>> ang . Finder ( smb , ids [3] )
9 (1 , 1)
10 >>> ang . Add2Contig ( smb , sa , sb , ids [ h ] , 1 ,1 )

The call to Add2Contig needs a little attention. The second argument to this
function is the sequence that is already in a contig. This is either of sa or sb. In Code
24.12 this was sb, but in Code 24.15 this is sa. The third argument is the sequence that
is to be added to the contig and the fourth argument is the name of that sequence.

24.2.3.5 Joining Contigs

The step in the process is shown in Code 24.16. This pairwise alignment mates segs[5]
and segs[7]. In this case both of these are already in separate contigs. As seen in lines 7
and 8 the segs[5] is in the third contig in the first position. As see in lines 9 and 10 the
segs[7] is in the second contig in the third position.

Code 24.16 Locating contigs.

1 >>> M [v , h ] = 0
2 >>> M . max ()
3 119
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (5 , 7)
7 >>> ang . Finder ( smb , ids [ v ] )
8 (2 , 0)
9 >>> ang . Finder ( smb , ids [ h ] )
10 (1 , 2)
11 >>> sa , sb = ang . ShiftedSeqs ( segs [ v ] , segs [ h ] , L [v , h ] )

The decision here is to join the two contigs. Probably one of the contigs will need to
be shifted to align with the other. This will require that a set of prefix periods be added

341
to all of the sequences in one of the contigs. Once aligned a new contig is created from
both of these contigs and the old contigs are destroyed. Thus, this new contig will be the
last one in the assembly.
This process is shown in Code 24.17. This uses the JoinContigs function. It
receives several arguments. The first is the assembly which will be modified. The next
two are the contig numbers from the returns of the Finder function. The next two are
the locations in those contigs. The final two arguments are the aligned sequences.

Code 24.17 Joining contigs.

1 >>> ang . JoinContigs ( smb , 2 , 1 , 0 , 2 , sa , sb )


2 >>> ang . ShowContigs ( smb )
3 s2 ........... L L L T S N G K L Q H R L A H R E F A H Q R T Y F A Q V E G S P T D E D L E P
4 s4 ............ L L T S N G K L Q H R L A H R E F A H Q R T Y F A Q V E G S P T D E D L E P
5 s0 MGRLDQDSEGLLLLTSNGKLQHRLAHREFAHQRTYFAQVEGSPTDEDLEP
6

7 s5 ............. A K I I T E P D F P P R N P P I R Y R A S I P T S W L S I T L T E G R N R
8 s6 GITFADYPTRPAIAKIITEPDFPPRNPPIRYRASIPTSWLSITLTEGRNR
9 s1 ..................................................
10 s3 ..................................................
11 s7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EGRNR

Line 2 uses the ShowContigs function to show the assembly. This shows only the
first 50 letters in each string. In this case some of the strings have been shifted by more
than 50 spaces and so they appear only as periods. The ShowContigs function also has
a second argument which is the location at which the display show begin. This is shown
in Code 24.18. This display starts at location 40 and so the first 10 elements of each string
should match the last 10 in the previous display. In this window the content of some of
the other strings can be seen.

Code 24.18 Showing a latter portion of the assembly.

1 >>> ang . ShowContigs ( smb ,40)


2 s2 GS P T D E D L E P L R R G I T F A D Y P
3 s4 G SP T D E D L E P L R R G I T F A D Y P T
4 s0 GSPTDEDLEP
5

6 s5 SI T L T E G R N R Q V R R M T A A V G F P T
7 s6 SITLTEGRNR
8 s1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GKSAATWC
9 s3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TG RSPQQG KGKSAA TWC
10 s7 ..... E G R N R Q V R R M T A A V G F P T L R L V R V Q I Q V T G R S P Q Q G K G K S A A T W C

342
24.2.3.6 The Assembly

Now all four possible decisions have been considered. Each time a new pairwise alignment
is considered the choice is to create a new contig, add to a contig, join contigs or do
nothing.
There are a few more housekeeping details necessary to make the full assembly.
This first is that the process should stop once the remaining values of M are below a
user-defined threshold. As seen in this example, the values of each considered alignment
is less than the previous. Eventually, the pairwise alignment that will be considered has
a score that is below the threshold. This should indicate that this pairwise alignment is
poor and should not be considered in the assembly. At this juncture the process should
be completed.
The second item is that there is no guarantee that all of the segments were used in
the alignment. Thus, the use of each item needs to be tracked. Those segments which are
not in any contig still need to be included in the assembly. Each of these sequences are
placed in their own contig and then appended to the end of the assembly.
The function Assemble is shown in Code 24.19. This function receives a list of the
sequences names, the list of the sequences, the substitution matrix and its alphabet, and
the user defined threshold. Line 3 creates a vector of 0’s and the length of this vector is the
number of sequences. When a sequence is placed in a contig the corresponding location in
this vector is set to 1 (line 24). At the end, any 0’s in this vector indicate that a sequence
was not used in the any contig.
Line 4 creates the M and L matrices. The best location in M is found in line 9. The
Finder function is called twice to determine if either sequence is in a contig. If Finder
returns (-1,-1) then the sequence was not in a contig. The sequences are aligned in line
14. Then there are four if statements that consider the possible choices. Lines 18 and 20
both call Add2Contig but the order of the inputs are different. The first one is used if
sa is found in a contig and the second is used if sb is found in a contig. Line 22 is used
if contigs need to be joined. If the current value of M[v,h] is below threshold then the
process exits the while loop. The final part in lines 27 through 29 is to create contigs
with single sequences for all of those sequences that were not used in any contig.
The call to Assemble is shown in Code 24.20. The names and the sequences have
been previously created. In this case, the alignment uses the BLOSUM50 matrix and its
alphabet. The user threshold is set to 59 as explained above. The first 50 elements are
shown. In this case all sequences are used in the assembly.
The module aligngreedy.py has a second function named AssembleML which per-
forms the same task except that the matrices M and L are computed outside of the
function. The reason is that creating these two matrices is by far the most time consum-
ing part of the computation. If the user wishes to try several assemblies (perhaps with
different threshold values) then it is prudent that the time consuming computation not be
repeated.

343
Code 24.19 The Assemble function.
1 # aligngreedy . py
2 def Assemble ( fnms , seqs , submat , abet , gamma = 500 ) :
3 used = np . zeros ( len ( fnms ) )
4 M , L = FastMat ( seqs , submat , abet )
5 ok = 1
6 smb = []
7 nseqs = len ( seqs )
8 while ok :
9 v , h = divmod ( M . argmax () , nseqs )
10 if M [v , h ] >= gamma :
11 vnum , vseqno = Finder ( smb , fnms [ v ] )
12 hnum , hseqno = Finder ( smb , fnms [ h ] )
13 print ( M [v , h ] , v , h )
14 s1 , s2 = ShiftedSeqs ( seqs [ v ] , seqs [ h ] , L [v , h ] )
15 if vnum == -1 and hnum == -1:
16 NewContig ( smb , s1 , s2 , fnms [ v ] , fnms [ h ] )
17 if vnum != -1 and hnum == -1:
18 Add2Contig ( smb , s1 , s2 , fnms [ h ] , vnum ,
vseqno )
19 if vnum == -1 and hnum != -1:
20 Add2Contig ( smb , s2 , s1 , fnms [ v ] , hnum ,
hseqno )
21 if vnum != -1 and hnum != -1 and vnum != hnum :
22 JoinContigs ( smb , vnum , hnum , vseqno , hseqno ,
s1 , s2 )
23 M [v , h ] = 0
24 used [ v ] = used [ h ] = 1
25 else :
26 ok = 0
27 notused = (1-used ) . nonzero () [0]
28 for i in notused :
29 smb . append ( [( fnms [ i ] , seqs [ i ]) ] )
30 return smb

344
Code 24.20 Running the assembly.

1 >>> smb = ang . Assemble ( ids , segs , blosum . BLOSUM50 , blosum .


PBET , 50 )
2 >>> ang . ShowContigs ( smb )
3 s2 ........... L L L T S N G K L Q H R L A H R E F A H Q R T Y F A Q V E G S P T D E D L E P
4 s4 ............ L L T S N G K L Q H R L A H R E F A H Q R T Y F A Q V E G S P T D E D L E P
5 s0 M G R L D Q D S E G L L L L T S N G K L Q H R L A H R E F A H Q R T Y F A Q V E G S P T D E D L E P
6 s5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 s6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 s1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 s3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 s7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24.3 The Non-Greedy Approach

The greedy approach is based on finding the best pairs of alignments. While there is
some logic to this approach it does not necessarily find the best alignment. The non-
greedy approach only scores the total alignment and does not attempt to find the best
pairs of alignments. There are many different non-greedy approaches of which only one is
presented here.
The example approach uses a genetic algorithm (GA) to create several sample assem-
blies and then optimizes by creating new assemblies from the best of the older assemblies.
Each gene creates an assembly and each assembly contains multiple contigs. Each contig
is used to generate a consensus sequence. The assembly is converted to a catsequence
which is the concatenation of the consensus sequences. The goal in this case is to find the
assembly that creates the shortest catsequence and thus the cost of the gene is length of
the catsequence that it eventually generates.
The data for this system is generated as in the greedy case. Code 24.21 reviews the
commands needed to generate the data for this section.

24.3.1 Creating Genes

The gene for the GA needs to encode a method by which an assembly is created. In
the greedy case the assembly was created by considering pairs of sequence alignments in
order of their alignment score. In the non-greedy case the use of alignment scores for
pairs of sequences is not used. Rather an assembly is created by a random sequence of
alignment pairs. The matrix M contains the scores for the alignments and in this case its
sole purpose is to provide a list of possible alignment pairs, which these are elements in M
which are above a small threshold. Code 24.22 uses the function BestPairs which creates
a list of all elements in M that are above a threshold γ. Each entry in the list is the v, h

345
Code 24.21 The commands for an assembly.

1 >>> import genbank as gbk


2 >>> import aligngreedy as greedy
3 >>> import blosum
4 >>> data = gbk . ReadFile ( ' data / XM_001326205 . gb . txt ' )
5 >>> klocs = gbk . FindKeywordLocs ( data )
6 >>> p1 = gbk . Translation ( data , klocs [0])
7 >>> chops = greedy . ChopSeq ( p1 , 15 , 50 )
8 >>> M , L = greedy . FastMat ( chops , blosum . BLOSUM50 , blosum . PBET
)

from the M[v,h] locations that qualify. In this case the data generated 90 elements in M
that were above the threshold of 5. The first ten of these are shown.

Code 24.22 Using the BestPairs function.

1 >>> import nongreedy as ngd


2 >>> hits = ngd . BestPairs ( M , 5 )
3 >>> len ( hits )
4 90
5 >>> hits [:10]
6 [(5 , 9) , (6 , 14) , (12 , 13) , (0 , 7) , (2 , 8) , (8 , 13) , (2 , 4) ,
(1 , 14) , (1 , 6) , (8 , 12) ]

This particular list is in order according to the magnitude of the values in M.


Basically, these would be the base pairs that would be extracted in each loop inside of Code
24.19. However, a rearrangement of these pairs can also be used to create an assembly.
Code 24.23 shows the greedy assembly using this method. Basically, this function creates
and assembly by considering each alignment pair in a prescribed order. A different order
of the same alignment pairs creates a different assembly.
Thus, the gene in this case is merely the order in which the pairs of sequences are
considered in building an assembly. The assembly, however, still needs to be converted to
a catsequence. This is accomplished by converting each contig to a consensus sequence
as shown in Figure 24.3. The letters in column k of a contig are used to create an element
of the consensus sequence cs[k].
In real applications there is not a complete agreement in each column, as there can
be more than one letter in a column. Often there is one letter that is seen considerably
more often than the others. This is the consensus letter. The ConsensCol function
receives a list of characters from a single column of a contig, stg. It will extract the
consensus character from this list excluding the periods as shown in Code 24.24.
A contig has several columns the consensus sequence is created by the function

346
Code 24.23 Showing two parts of the assembly.

1 >>> ids = []
2 >>> for i in range ( 15 ) :
3 ids . append ( ' s ' + str ( i ) )
4 >>> smb = ngd . Gene2Assembly ( range (90) , hits , chops , ids , L )
5 >>> greedy . ShowContigs ( smb )
6 s6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 s14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 s1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 s3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ARYIV
10 s5 . R E E L V R K E I Q L A N I T E F D F C F P T P L F F L N Y F L R I S G Q T Q E S M L F A R Y I V
11 s9 N R E E L V R K E I Q L A N I T E F D F C F P T P L F F L N Y F L R I S G Q T Q E S M L F A R Y I V
12 s2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 s8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14 s12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15 s13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16 s4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17 s0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18 s7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19 s11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20 s10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21

22 >>> greedy . ShowContigs ( smb ,50)


23 s6 .......................... VVYSETPWTEDLMMFSRYSLKDLS
24 s14 ............................. SETPWTEDLMMFSRYSLKDLS
25 s1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DLS
26 s3 EMCLTSEKFNDVKASAIAATAVVIMRVVYSETPWTEDLMMFSRYS
27 s5 E
28 s9
29 s2 ..................................................
30 s8 ..................................................
31 s12 ..................................................
32 s13 ..................................................
33 s4 ..................................................
34 s0 ..................................................
35 s7 ..................................................
36 s11 ..................................................
37 s10 ..................................................

347
Figure 24.3: Aligning sequences for a consensus.

Code 24.24 The ConsensusCol function.


1 >>> a = [ ' a ' , ' b ' , ' c ' , ' a ' , ' d ' , ' b ' , ' b ' ]
2 >>> ConsensusCol ( a )
3 'b '

CatSeq called in Code 24.25. The next step in this process is to realize that an assembly
contains several contigs, and these do not overlap. For the purposes of scoring the assembly
a single long string is created from all of the contigs. The non-overlapping contigs are
concatenated into sq. The example creates a single string from the assembly generated in
Code 24.23.

Code 24.25 The CatSeq function.

1 >>> sq = ngd . CatSeq ( smb )


2 >>> sq
3 ' NREELVRKEIQLANITEFDFCFPTPLFFLNYFLRISGQTQESMLFARYIVEMCLTSE
4 KFNDVKASAIAATAVVIMRVVYSETPWTEDLMMFSRYSLKDLSSNIRDAYEILTDLER
5 EESTFIRLKYGSDTYQNVAEFEIPPVIFKQAITNSQKGQMIDWIDRLHYKSQCCTTSL
6 YRAIGIFNRAINLTNITPDSMRQFAAASLLIASKMEDLQPVEIDLQILTSPKKIQDND
7 VQIKSEDVFVHTEMQIGDPTNIQDVIEYENI '

A gene is merely an ordering of the sequence pairs used to create an assembly. Code
24.26 creates an instance of a GA class and uses the function InitGA to create random
arrangements of the sequences pairs. In this example each gene is a list of numbers from
0 to 98 in a random arrangement.

24.3.2 Steps in the Genetic Algorithm

The cost of a gene is the length of the consensus sequence that it creates. Code 24.27
shows the function CostAllGenes which considers each gene in the for loop. Each gene
creates and assembly smb which in turns creates a catsequence cseq. The cost of this
sequence is its length. In this example there are 10 genes and the costs they generated are
shown.
The crossover function is not changed from the original GA program. Code 24.28

348
Code 24.26 The InitGA function.
1 >>> import ga
2 # nongreedy . py
3 def InitGA ( pairs , Ngenes ) :
4 # pairs from BestPairs
5 # Ngenes = desired number of GA genes
6 work = np . arange ( len ( pairs ) )
7 genes = []
8 for i in range ( Ngenes ) :
9 np . random . shuffle ( work )
10 genes . append ( copy . deepcopy ( work ) )
11 return genes
12

13 >>> folks = ngd . InitGA ( hits , 10 )

Code 24.27 The CostAllGenes function.


1 # nongreedy . py
2 def CostAllGenes ( genes , pairs , seqs , seqnames , L ) :
3 NG = len ( genes )
4 cost = np . zeros ( NG )
5 for i in range ( NG ) :
6 smb = Gene2Assembly ( genes [ i ] , pairs , seqs , seqnames ,
L )
7 cseq = CatSeq ( smb )
8 cost [ i ] = len ( cseq )
9 return cost
10

11 >>> fcost = ngd . CostAllGenes ( folks , hits , chops , ids , L )


12 >>> fcost
13 array ([ 194. , 173. , 162. , 229. , 234. , 256. , 193. ,
238. , 218. , 177.])

349
shows the calls to create new genes and to compute their cost. The problem with the new
genes is that they may not contain all of the pairings and they may contain two copies of
other pairings.

Code 24.28 Using the CostAllGenes function.

1 >>> import ga
2 >>> kids = ga . CrossOver ( folks , fcost )
3 >>> kcost = ngd . CostAllGenes ( kids , hits , chops , ids , L )
4 >>> kcost
5 array ([ 238. , 173. , 173. , 162. , 162. , 177. , 188. ,
235. , 160. , 185.])

A gene should contain each pair of sequences from the original list and these new
genes are not yet correct. The function CostAllGenes considers a gene and finds those
elements that are duplicates and replaces one of the duplicates with one element that is
missing. Each child gene is processed and thus it is necessary to have a loop to process
all children as shown in Code 24.29. Now all GA genes have only one instance of each
pairing. The cost of the children can now be computed.

Code 24.29 Using the CostAllGenes function for the offspring.

1 >>> for i in range ( len ( kids ) ) :


2 kids [ i ] = ngdFixGene ( kids [ i ] , arange ( len ( hits ) ) )
3 >>> kcost = ngd . CostAllGenes ( kids , hits , chops , ids , L )
4 >>> kcost
5 array ([ 238. , 173. , 154. , 219. , 149. , 158. , 204. ,
173. , 131. , 196.])

The mutation stage uses a function named SwapMutate that swaps the elements
in the GA genes much like in the alphabet program above.

24.3.3 The Test Run

All of the parts are now in place to perform the GA. Code 24.30 shows the function
RunGA which drives the GA process. It settles rather quickly on an assembly that
creates a consensus sequence that has a length of 120.
The results of the non-greedy test are compared to the greedy approach. Code 24.31
shows the steps used to create a greedy consensus. The length of the greedy consensus is
267 while the length of the non-greedy approach is only 120. Obviously, the non-greedy
approach significantly outperformed the greedy approach. The cost of this improvement
though is that the non-greedy approaches are usually computationally expensive.

350
Code 24.30 The RunGA function.
1 # nongreedy . py
2 def RunGA ( hits , seqs , seqnames , L ) :
3 NH = len ( hits )
4 folks = InitGA ( hits , 10 )
5 fcost = CostAllGenes ( folks , hits , seqs , seqnames , L )
6 print fcost . min () , fcost . argmin ()
7 for i in range ( 10 ) :
8 kids = ga . CrossOver ( folks , fcost )
9 for i in range ( len ( kids ) ) :
10 kids [ i ] = FixGene ( kids [ i ] , arange ( NH ) )
11 kcost = CostAllGenes ( kids , hits , seqs , seqnames , L )
12 ga . Feud ( folks , kids , fcost , kcost )
13 SwapMutate ( folks , 0.03 )
14 fcost = CostAllGenes ( folks , hits , seqs , seqnames , L
)
15 print fcost . min () , fcost . argmin ()
16 return folks [ fcost . argmin () ]
17

18 >>> g = ngd . RunGA ( hits , chops , ids , L )

Code 24.31 Using the Assemble function.

1 >>> smb = greedy . Assemble ( ids , chops , blosum . BLOSUM50 ,


blosum . PBET , 20 )
2 >>> cseq = ngd . CatSeq ( smb )
3 >>> len ( cseq )
4 281

351
24.3.4 Improvements

The non-greedy approach presented is still not the best system and does have a flaw.
Consider the sequences:
1 S 1 = abcdef
2 S 2 = defghi
3 S 3 = jkldef

It is quite possible to align S1 with S2 and then S2 with S3. In doing so the following
assembly is created:
1 abcdef
2 ... defghi
3 jkldef

In this assembly the S1 and S3 do not align all that well. Such problems are likely
to occur when building an assembly from pairs of sequences. An improvement to the
GA program would be to prevent such poor secondary alignments from occurring or to
increase the cost of the assembly if there is a poor consensus.
It is important to note that there is no set method of creating a non-greedy algorithm.
The GA is only one method and as seen it could be modified to behave differently. The
main purpose of the non-greedy approach is to create a system that scores the entire
assembly rather than finding the best matches within it.

24.4 Summary

The previous chapter aligned two sequences. However, many applications require the
alignment of more than two sequences. Multiple sequence alignment can be performed
through two differing philosophies. The first is a greedy approach in which the assembly
is constructed by adding pairs of sequences according to their pair alignment scores. The
non-greedy approach attempts to find the best overall assembly by using machine learning
techniques. This approach does not consider the alignments according to their pairing
scores but rather attempts to optimize the entire alignment. The latter approach is much
more expensive but can provide better results.

Problems

1. Run the greedy assembly with a threshold that is 90% of the maximum value in M.
Interpret the results.
2. Apply the greedy algorithm to English text. Chop written text up into many sub-
sequences and then assemble using the greedy approach. Is this assembly similar to

352
the original?

3. Use different matrices (BLOSUM, PAM, etc.) in computing BruteForceSlide.


Does the use of a different matrix change the assembly?

4. Measure the scale-up effect on computation time. For strings of different sizes com-
pute the assembly and measure the time of computation. Plot the computational
time versus the size of the original data string.

5. Modify the greedy algorithm to handle sequences and their complements. The pro-
gram should note that if a string is used in making a contig then it and its comple-
ment should be removed from further consideration.

6. Is it possible to have an consensus sequence that is shorter than the original sequence?
In this case the original data is completely represented in the sequence segments used
as inputs. Consider a case in which the original sequence has a repeating segment
and that this repeating region is longer than the cut length used when chopping up
the original sequence.

353
354
Chapter 25

Trees

em Trees are a very effective method of organizing data and coursing through data to
find relationships. This chapter will review a few types of dictionaries but again is not an
exhaustive study.

25.1 Dictionary

The dictionary in a word processor does not search the entire English dictionary every
time a new word is typed. That would be a horrendously inefficient process. One approach
is to build search tree to speed up the spell checking process.
The tree is a simple design where there are two basic types of nodes. One type is
an intermediate node which is a letter that is not at the end of a word and the second is
a terminal node which represents the end of a word although not necessarily the end of a
tree branch.
A simple example is to build a tree from the following words:

ˆ CAT
ˆ CART
ˆ COB
ˆ COBBLER

These four words are organized in a tree search as shown in Figure 25.1. The shaded
nodes are those which hold the last letter of a word.

355
Figure 25.1: The dictionary tree for the four words.

25.2 Sorting

Given a vector of numbers the search for the maximum value can easily be performed as
shown in Code 25.1. Line 2 creates the data and line 3 sets the variable mx to the first
value in the vector. In the for loop each value is compared to that of mx. If the considered
value is greater than mx then mx takes on this new value as shown in Line 6. Of course, in
the numpy package there already exists a max as shown in Line 9.

Code 25.1 A slow method to find a maximum value.


1 >>> import numpy as np
2 >>> a = np . random . rand (10)
3 >>> mx = a [0]
4 >>> for i in a :
5 if i > mx :
6 mx = i
7 >>> mx
8 0.9070 08 91 12 76 09 34
9 >>> a . max ()
10 0.9070 08 91 12 76 09 34

Sorting data can be performed by repeatedly performing the maximum function on


the remaining data. For example, in the first iteration the max is found and then removed
from consideration. Then the next max is located and removed from consideration. The
process continues until all of the data has been placed in the list collecting the maximums.
This is a very inefficient manner to perform the sorting algorithm.
The numpy package does offer the argsort function that returns the indexes of the
data thus sorting the data from low to high. Line 1 of Code 25.2 creates a random vector

356
which is shown starting in line 3. The argsort command is applied in line 5. This returns
the indexes in a sorted order. In this case the first index that is returned is 3 and thus
a[3] is the lowest value in the vector.

Code 25.2 Using commands to sort the data.

1 >>> a = np . random . rand (10)


2 >>> a
3 array ([ 0.379 , 0.718 , 0.41 , 0.018 , 0.318 , 0.64 ,
4 0.909 , 0.716 , 0.898 , 0.963])
5 >>> ndx = a . argsort ()
6 >>> ndx
7 array ([3 , 4 , 0 , 2 , 5 , 7 , 1 , 8 , 6 , 9] , dtype = int64 )
8 >>> a [ ndx ]
9 array ([ 0.018 , 0.318 , 0.379 , 0.41 , 0.64 , 0.716 ,
10 0.718 , 0.898 , 0.909 , 0.963])
11 >>> a . sort ()
12 >>> a
13 array ([ 0.018 , 0.318 , 0.379 , 0.41 , 0.64 , 0.716 ,
14 0.718 , 0.898 , 0.909 , 0.963])

Line 8 shows the command to display all of the data in the sorted order. Line 11
shows the sort command which actually rearranges the data in the vector. The original
location of the data is destroyed with this command.

25.3 Linked Lists

Moving data about in a computer memory is expensive for large amounts of data. Thus,
the concept of a linked list is used to sort the data without moving the data.
In the linked list concept each piece of data also contains an identification and a
link. This is shown in Figure 25.2. In this case there are four pieces of data and for the
example the IDs are 1, 2, 3 and 4 respectively. However, the data is not in a sorted order.
In this example, the last piece of data has the lowest value and the first piece of data has
the next lowest value. Instead of moving the last piece of data the link is changed to point
to the first piece of data.

Figure 25.2: A linked list.

357
A different example is shown in Figure 25.3. Initially, there are three pieces of data
and they are sorted. Now, a fourth piece of data is added. It is placed at the end of the
data where there is empty memory in the computer. The links are then rearranged as
shown in the lower portion of the image thus indicating the sort order without actually
moving the data.

Figure 25.3: A linked list.

There are multiple manners in which a linked list can be created in Python. One
approach is to use a dictionary as shown in Code 25.3. An empty dictionary is created in
line 1 and the first data item is placed in line 2. In this scheme the ID is the key in the
dictionary and the tuple contains the data value and the link. In this case the link is -1
indicating that it is not linked to any other data.

Code 25.3 Populating the dictionary.

1 >>> dct = {}
2 >>> dct [0] = [0.18 ,-1]
3 >>> dct [1] = [0.35 ,-1]
4 >>> dct [0][1] = 1
5 >>> dct [2] = [0.2 ,-1]
6 >>> dct [0][1] = 2
7 >>> dct [2][1] = 1

A second piece of data is created in line 3 and it is also not linked to any other piece
of data. For the data to be in sort order then the first data needs to link to the second
and so in line 4 the link of the first item is changed to the ID of the second item.
A third item is created in line 5 and it is to be inserted between the previous two
items. So, its link and the item that links to it are modified in the final two lines. This is
shown in Figure 25.4.
Once the data is in a linked list then the recall of the data is simple. Code 25.4 starts
with creating an empty list named answ which will collect the data in a sorted order. Line
2 creates the integer k which will keep track of the location in the list. It is initially set to
the first item in the linked list. This is not necessarily the first item in the dictionary. In
the case of sorting data this is the item in the dictionary with the lowest data value. In
the case of Figure 25.4 k=0.

358
Figure 25.4: A linked list.

Code 25.4 Printing the results.

1 >>> answ = []
2 >>> k = 0
3 >>> while k !=-1:
4 d , k = dct [ k ]
5 print (d , k )
6 answ . append ( d )
7

8 0.18 2
9 0.2 1
10 0.35 -1
11 >>> answ
12 [0.18 , 0.2 , 0.35]

The while loop extracts each piece of data. Line 4 retrieves the data and the link
to the next item. These are printed to the console. Line 6 places the retrieved data into
the answer list. The process continues until the last item is found which will have a link
of -1.

25.4 Binary Tree

A binary tree is similar to a linked list except that every node has two links. An example
is shown in Figure 25.5. The flow starts at the top and each parent node has up to two
child node. A node without any children is called a terminal node.
Binary trees are used for several different applications. The example used here is
that the tree is used to sort the data. As seen in Figure 25.5 every child node has a data
value larger than its parent. When a new node is added it is attached at any open child
location. Then the process moves the node upwards according to the rule that all parents
must have a lower data value than their children.
Consider the case in Figure 25.6 in which nodes V1 and V4 violate this rule. The
procedure is for V1 and V4 to swap positions in the tree. The result is shown in Figure
25.7. This process continues moving V4 upwards until the parent/child rule is no longer

359
Figure 25.5: A binary tree.

violated.

Figure 25.6: A tree for sorting with incorrect positions of V1 and V4.

The swapping process looks easy, but it does involve several other nodes. Figure
25.8 shows the same tree but highlights all of the links that need to be adjusted when
swapping V1 and V4.
After all of the data is in the three then the next step is to remove nodes such that
the data is in order from lowest value to highest. If the parent/child rule is obeyed then
the node with the lowest data value must be at the top of the tree.
The data from this node is placed into an answer list and this node is then removed
from the tree. One of the two child must be raised up to replace this node. The child
with the lowest data value is chosen and moved up to replace the parent. This is shown
in Figure 25.9
This leaves an empty slot and one of the children, V1 or V3, must move up into the
empty slot. The child with the lowest data value is chosen and moved upwards. In this

360
Figure 25.7: A tree for sorting.

Figure 25.8: The affected nodes.

361
Figure 25.9: Removal of the first node.

case that is V1. The result is shown in Figure 25.10, but as seen this leaves a new hole in
tree.

Figure 25.10: Replacing a hole.

The steps of replacing a hole are repeated until a terminal node is reached. The
result is shown in Figure 25.10.
After this is completed then the new top node is removed and the data is placed
into the answer list. Again this leaves a hole at the top and the process of moving nodes
upwards to replace holes is repeated. The removal of the top node and hole-filling is
repeated until the tree is empty. The answer list will contain all of the data in a sorted
order.
While this process is a little more complicated than brute force searches it is sig-
nificantly faster for large data. Consider the case were there is 1,000,000 pieces of data.
To find the minimum value a program would need to search the entire list of data. Thus,
the loop would have 1,000,000 comparisons. That only finds the first maximum. To
sort the data this process is repeated 1,000,000 times except that each time that it is
repeated the size of the data set is slightly smaller. So, the total number of comparisons
is 1, 000, 000 × 1, 000, 000/2 = 5 × 1011 .

362
Figure 25.11: Replacing a hole completion.

Figure 25.12: The process of the second node.

363
Now consider the tree search with 20 layers. Each time a node is added it could
have up to 20 swaps to properly place it is in the tree. Although on average the number
of swaps would be less than 10. The same is true for the process of removing a node. So,
each node is responsible for roughly 20 swaps (or comparisons). Since there are 1,000,000
nodes the adding and removal process needs to be repeated that many times. So the
sorting process using a tree requires roughly 1, 000, 000 × 20 = 2 × 107 comparisons. That
is significantly less than the brute force method.
Creating Python code for a binary tree is almost the same as a linked list. Code
25.5 shows the same concept of using a dictionary except that each node has two possible
links. Since this is the only node in the tree both links are -1.

Code 25.5 Initiating a tree.

1 >>> tree = {}
2 >>> tree [0] = [0.4 , -1 , -1]

25.5 UPGMA

The UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm builds
a simple tree by continually finding best pair matches and replacing them with a parent
node. Consider a case shown in Code 25.6 which generates six vectors, each of length 10,
containing random values. The matrix M shows the cost values of every possible pair. The
difference value is subtracted from the max value thus creating a cost such that a lower
cost is a better match. Only the lower left portion of the matrix needs to be computed.
In this example the best score is 8.5 and belongs to column 1 and row 5. Therefore,
the best matching data vectors are data[1] and data[5]. The UPGMA creates a small
tree from these two data vectors as shown in Figure 25.13 where each data vector is
represented by V. At the top of this tree is V7 which is not part of the original data.
This is an artificial data vector created from the average of V1 and V5. This new data
vector is added to the other data vectors and V1 and V5 are both removed from further
consideration.

Figure 25.13: The first pairing in the UPGMA.

This maneuver will require that M be of a bigger size. In fact, in the UPGMA
algorithm the size of M is (2N − 1) × (2N − 1) where N is the number of original data

364
Code 25.6 Creating data.
1 >>> from numpy import random, zeros
2 >>> data = random.ranf( (6,10) )
3 >>> M = zeros( (6,6), float )
4 >>> for i in range( 6 ):
5 for j in range( i ):
6 M[i,j] = 10 - (abs( data[i]-data[j])).sum()
7

8 >>> M
9 array([[ 0. , 0. , 0. , 0. , 0. , 0. ],
10 [ 7.2, 0. , 0. , 0. , 0. , 0. ],
11 [ 6.1, 5.4, 0. , 0. , 0. , 0. ],
12 [ 5.6, 5.5, 6.9, 0. , 0. , 0. ],
13 [ 6.0, 6.1, 7.0, 7.4, 0. , 0. ],
14 [ 6.8, 8.5, 6.3, 6.3, 7.0, 0. ]])

vectors. Code 25.7 initializes M to this new size and fills it with the scores as in Code
25.6. The maximum value is located and returned as location v, h. Furthermore, there is
going to be a need for (2N − 1) data vectors and these are established as vecs.

Code 25.7 Making M and partially filling it with data.


1 >>> M = zeros( (11,11), float )
2 >>> for i in range( 6 ):
3 for j in range( i ):
4 M[i,j] = 10 - (abs( data[i]-data[j])).sum()
5 >>> v,h = divmod( M.argmax(), 11 )
6 >>> v,h
7 (5, 1)
8 >>> vecs = zeros( (11,10), float )
9 >>> for i in range( 6 ):
10 vecs[i] = data[i] + 0

Figure 25.13 requires that a new vector, vecs[7], be created which is shown in Code
25.8. The loop in Lines 3 and 4 computes the score of this new vector to the others. Lines
5-8 eliminate all rows and columns that are associated with vecs[1] and vecs[5]. The
variable last keeps track of the last known vector in the list and it increments with each
new vector.
The next iteration finds the new largest value in M and repeats the process. In this
example vecs[3] and vecs[4] generate the best match and so a new tree is created with
these two. This is shown in Figure 25.14. On the third iteration the best match is between

365
Code 25.8 Altering M after the creation of a new vector.
1 >>> last = 7
2 >>> vecs[last] = (vecs[v] + vecs[h])/2.0
3 >>> for i in range( last ):
4 M[last,i] = 10 - (abs( vecs[i]-vecs[last])).sum()
5 >>> M[v] = zeros(11)
6 >>> M[h] = zeros(11)
7 >>> M[:,v] = zeros(11)
8 >>> M[:,h] = zeros(11)
9 >>> last += 1

vecs[2] and vecs[8], however, vecs[8] is already in a tree. Thus, vecs[2] is attached
to the existing tree creating vecs[9]. This is shown in Figure 25.15. The final type of
iteration is one in which both of the vectors exist in different trees. In this case the two
trees are joined together as shown in Figure 25.16.

Figure 25.14: The second iteration.

Figure 25.15: The third iteration.

The UPGMA function is shown in Code 25.9. The input data indata is a list of
the data vectors (not a matrix). The scmat is the matrix that contains the pairwise scores
(similar to the M matrix in previous examples). The list net collects the nodes as they are
computed. The list used collects the names of data vectors after they have been used to
prevent the re-use of these vectors. In the loop starting on Line 12 the best match is found
in Line 13 which returns the location in M where the best match occurs. It is appended
to the net and the average of the two constituent vectors is computed in Line 16. The
loop starting on Line 18 computes the similarity of the new vector with the previous only

366
Figure 25.16: The third iteration.

if they have not been previously used. The final command removes the comparison scores
for the two vectors that are being removed from further consideration.
In the example at the end six random data vectors of length 10 are used as inputs.
The tree is computed and printed. Recall that the tree is a dictionary and the data of the
dictionary contains the two children and the score. The tree produced by this system is
shown in Figure 1 9.

Figure 25.17: The tree for the results in Code 25.9.

Now that the tree is constructed it needs to be converted to a viewable format.


There are several websites that can create trees from given data, the only issue is to
convert the data in tree to a string format that the website requires. One such website
is http://iubio.bio.indiana.edu/treeapp/treeprint-form.html.
The function Convert in the indiana.py module performs this conversion. This
module is contained within the suite of Python scripts that accompany this book and the
call to this function is shown in Code 25.10. Line 3 prints the string to the console and this
string can simply be pasted into the form on the website given in the previous paragraph.
The user can select the style of the tree and it can be returned in a couple of different

367
Code 25.9 The UPGMA function.
1 # upgma.py
2 def UPGMA( indata ):
3 data = copy.deepcopy( indata )
4 N = len( data ) # number of data vectors
5 N2, BIG = 2*N-1, 999999
6 scmat = np.zeros( (N2,N2), float ) + BIG
7 # initial pairwise comparisons
8 for i in range( N ):
9 for j in range( i ):
10 scmat[i,j] = (abs( data[i]-data[j] )).sum()
11 tree, used = {}, []
12 for i in range( N-1 ):
13 v,h = divmod( scmat.argmin(), N2 )
14 tree[N+i] = (v, h, scmat.min() )
15 used.append( v ); used.append( h )
16 avg = ( data[v] + data[h])/2.
17 data.append( avg )
18 for j in range( N+i ):
19 if j not in used:
20 scmat[N+i,j] = (abs( avg-data[j] )).sum()
21 scmat[v] = np.zeros( N2 ) +BIG
22 scmat[h] = np.zeros( N2 ) +BIG
23 scmat[:,v] = np.zeros( N2 ) +BIG
24 scmat[:,h] = np.zeros( N2 ) +BIG
25 return tree
26

27 >>> from numpy import set_printoptions


28 >>> set_printoptions( precision = 3 )
29 >>> data = []
30 >>> for i in range( 6 ):
31 data.append( random.rand( 10 ))
32

33 >>> net = UPGMA( d )


34 >>> for i in net.keys():
35 print i,net[i]
36 8 (7, 0, 2.380)
37 9 (4, 1, 3.257)
38 10 (9, 8, 2.728)
39 6 (5, 3, 2.260)
40 7 (6, 2, 2.270)

368
digital formats.

Code 25.10 Using the Convert function.


1 >>> import indiana
2 >>> sg = indiana.Convert( tree )
3 >>> print( sg )

25.6 Non-Binary Tree

Of course it is possible to have a non-binary tree. In some cases, such as the dictionary
shown in Figure 25.18 this is desired. The only real difference to the Python script is the
number of links that are allowed. In this case the link integer can be replaced with a list
of links that can grow or shrink depending on the number of links that a node has.

Figure 25.18: A nonbinary tree.

However, it should be noted that in some applications, such as evolutionary trees,


there is a strong mathematical argument that states that only a binary tree is needed.
Trees with multiple nodes can also be represented as binary trees.

25.7 Decision Tree

A decision tree is used to sort through a decision that involves multiple components.
Consider the case of sorting through health information. In this case data from several

369
people is collected. Some of these people have a specified illness while the others do not.
The data collected can include things such as (smoking, drinking, living location, age,
diet, exercise, genetics, etc.). Which of these factors contribute to the illness?

25.7.1 Theory

Consider just one of the factors such as sugar intake. Each person has a certain number of
grams of sugar they consume each day. The chart in Figure 25.19 shows the distribution
of healthy and sick patients versus their sugar intake. The x-axis is the amount of sugar.
The green line (on the right) shows the histogram of sick people versus their sugar intake.
The red line (on the left) shows the histogram of the healthy people.

Figure 25.19: Data distribution.

As seen the distributions are quite distinct and therefore the sugar intake is a good
indicator of whether a person is going to be contract this particular disease. In this case
a vertical line can be drawn where the two curves intersect. This is the decision line and
is shown in Figure 25.20. If a new patient is seen and their sugar intake is measured then
the decision to be made is basically if they are left or right of this decision line. In this
case the decision line is about x = 1.8. Now, this decision is not perfect as some people
with x < 1.8 have become sick and some people with x > 1.8 remain healthy.
This example is an ideal case and usually reality is more like the distribution shown
in Figure 25.21. A decision line can still be created but there will be a lot of people that
will be erroneously classified.
The bell curves, or Gaussian distributions, can be computed from the average and
standard deviations of the data. The height of the curve is,
2 /2σ 2
y = Ae−(x−µ) , (25.1)

where µ is the average and σ is the standard deviation. For this case the amplitude, A,
is set to 1. The x is the input variable (location on the horizontal axis) and the y is the
output (height of the function). The µ is the horizontal location of the center of the curve,
and the σ is the half width at half the height.

370
Figure 25.20: A decision.

The crossover point occurs when both curves have the same y value for a given input
x. Thus,
2 2 2 2
e−(x−µ1 ) /2σ1 = e−(x−µ2 ) /2σ2 , (25.2)
where the subscripts 1 and 2 represent the two curves. The next step is to solve for x and
so the log of both sides becomes,

− (x − µ1 )2 /2σ12 = −(x − µ2 )2 /2σ22 , (25.3)

and each side is multiplied by − 12 and then the square root of both sides produces,

x − µ1 x − µ2
= . (25.4)
σ1 σ2

Now it is possible to solve for x. However, there is an issue in that these two curves may
actually have two crossover points. Such a case is shown in Figure 25.22(a). So, the proper
equation is,
x − µ1 x − µ2
=± , (25.5)
σ1 σ2
after noting that in the process of computing a square root that it is possible to have two
solutions. Usually, the point that is to be used is the crossover point that is in between
the two peaks.
A decision tree considers each of the variables of which three examples are shown
in Figure 25.22. None of the variables is dividing the data nicely. However, the second
variable performs better than the others. So, this variable is selected as the first node in
a decision tree.
The decision line is created and all of the data is sorted according to the decision
line. Of course, some of the data will be mis-classified. An example (from a different

371
Figure 25.21: Closer to reality.

(a) (b) (c)

Figure 25.22: Distribution of people for three variables.

Figure 25.23: A decision node.

372
problem) is shown in Figure 25.23. This node uses parameter (or factor) 4. The decision
line is at x = 0.52. The training data is sorted as shown.
Had this node been able to perfectly sort the data then all of the data on one side
would be classified as False (healthy) and all of the data on the other side would have
been classified as True (sick). As seen this node was not perfect.
So, the next step is to create children nodes based on the sorted data. So, the
child node on the left would only consider the data that was sorted to the left in this
initial node. The process continues until every node either has a child node or the data is
perfectly sorted as shown in Figure 25.24.

Figure 25.24: A decision tree.

After the tree is constructed then it is possible to make a decision. Consider a


patient that has the parameters (0.1, 0.3, 0.3, 0.2, 0.9). The first node uses parameter 4
and this patient has a value of 0.9. This is greater than the threshold γ = 0.52 and so
this patient would be sent to the right on the tree. The next node is parameter 3 and this
patient has a value of 0.2 which is less than the threshold of γ = 0.43 so this patient is
sent to the right. The next parameter is 0 and the patient has a value of 0.1 and this is
sent to the left. All of the patients in this group are classified as False (healthy) and so
the decision is reached that his patient will not contract this disease.

25.7.2 Application

This section will walk through a demonstration of building and using a decision tree. In a
single tree there are multiple nodes which have attributes and functions. Therefore, there

373
is an advantage for creating the nodes as an object-oriented class. Furthermore, a real
problem could employ more than one tree and thus the tree is also constructed as a class.
First, though, it is important to generate a data set for this example problem.

25.7.2.1 Data

In order to generate usable data for a decision tree it is necessary that the data have some
structure. It is not possible to make a decision on purely random data.
Fake data is created in the FakeDtreeData function shown in Code 25.11. The
philosophy is that this is generating data for N patients and for each patient a set of
M parameters are measured. Each patient is classified as either sick or healthy (True or
False). The function receives the N and M parameters as arguments.

Code 25.11 The FakeDtreeData function.


1 # decidetree . py
2 def FakeDtreeData ( N , M ) :
3 prms = np . random . rand ( M ) **2
4 data = []
5 for i in range ( N ) :
6 mylife = np . random . rand ( M )
7 temp = ( mylife * prms ) . sum () / np . sqrt ( M )
8 sick = temp > 0.5
9 data . append ( (i , sick , mylife ) )
10 return data

In line 3 a vector of parameters, prms, is created. The parameters emulate measure-


ments by a physician. These are random numbers that are squared. The average value
in this vector is close to 0.33 and the standard deviation is about 0.3. Thus, most of the
numbers are below 0.5. This set of parameters will be applied to all patients. The idea
is that the view large values are important towards determining the health of the patient,
but the physician would not know which of these values are the large ones. It is the task
of the decision tree to determine which parameters are important.
Line 6 generates random numbers for a single patient. These are then multiplied
by the parameters in line 7. Patients that have high values in the same place as the prms
vector will have a higher value of temp. If this value is over 0.5 then line 8 classifies the
patient as sick. The information for a single patient is their ID, their state of health, and
their random vector. The prms data is not returned. This creates N patients, each with
M parameters that are somehow related to their state of health.
Code 25.12 shows the call to the data. In this case, the random function generator
is given a seed so that the data can be replicated. The output is a list data which contains
N tuples. Each tuple has the ID, health state, and the patient’s M parameters.

374
Code 25.12 Using the FakeDtreeData function.

1 >>> import numpy as np


2 >>> import decidetree as dte
3 >>> np . random . seed ( 20236 )
4 >>> data = dte . FakeDtreeData ( 20 , 10 )

Storing the information in a list is not the most efficient method for computation
processing. So, the next step is to create two matrices. One matrix will contain the data
for sick patients and the other will store data for healthy patients. Each row in a matrix
is the patients M parameters. Since the number of patients is not set the data is collected
in lists as shown in lines 1 through 6 in Code 25.12. The last two lines then converts these
lists into matrices.

Code 25.13 Separating the data.

1 >>> Ts , Fs = [] , []
2 >>> for d in data :
3 if d [1]:
4 Ts . append ( d [2] )
5 else :
6 Fs . append ( d [2] )
7

8 >>> Ts = np . array ( Ts )
9 >>> Fs = np . array ( Fs )

25.7.2.2 Scoring a Parameter

The first step in creating the first node in the decision tree is to compute the ability of each
parameter to separate the sick patients from the healthy patients. This follows the process
shown in Section 25.7.1. For each of the M parameters the distributions are computed
and the intersections of the distribution curves is determined. The score is the ability of
a parameter to separate the two classes of patients.
The ScoreParam function computes the score for a single parameter. It is a rather
lengthy function and so it is not shown here. However, Code 25.14 shows the concepts of
the function. The call to the function receives the data and the parameter being tested.
Thus, this function will be called M times, once for each parameter.
The first step is to gather the statistics for that single parameter. These are the
average and standard deviation for that parameter for the sick and again for the healthy
patients. If the standard deviation is less than 0.1 then it is set to 0.1. Values that are
too small generally appear from small data sets and are not representative of the actual

375
data.

Code 25.14 Concepts of the ScoreParam function.

1 # decidetree . py
2 def ScoreParam ( data , prm ) :
3 # convert to vectors and get stats
4 # # avg and stdev of sick and healthy , stdev min = 0.01
5 # find crossover
6 # count the sicks on the left side
7 # count the healthy on the left side
8 return score , x
9

10 >>> dte . ScoreParam ( data , 0 )


11 (0.5505050505050505 , 0 . 4 0 4 0 7 7 8 5 6 3 1 4 8 4 8 4 9 )
12 >>> dte . ScoreParam ( data [:10] , 0 )
13 (0.8 , 0 . 3 9 2 4 5 2 0 1 6 8 4 6 3 3 8 4 6 )

From the average and standard deviations the Gaussian distributions can be plotted
according to Equation (25.1). Line 5 indicates that the next step is to find the crossover
point which follows the discussion ending with Equation (25.5). Now, the node has the
crossover point and it is possible to separate the data vectors by sending each vector to
either the left or right branches of the node. The next step in this algorithm is to determine
the percentage of sick and healthy people that went to each side of the nodes. If this node
perfectly separated all of the patients into sick and healthy then it will produce a high
score of 1.0. This function returns that score and the crossover point x.
Two examples are shown. The first example computes the score for all of the data
for the first parameter. The score is 0.55 an the crossover value is x = 0.4. The second
example performs the same test for the first ten vectors only. Of course the score is closer
to 1 since it is easier to separate few data vectors. The crossover point, though, is almost
the same, which lends confidence that the process is behaving.
This process is applied to all nodes and the one with the highest score is believed to
be the best at separating the healthy patients from the sick patients. It will become the
top node in the tree.

25.7.2.3 A Node

The tree will consist of several nodes and therefore there is justification for an object-
oriented approach. Each node will need to contain several values. It will need the pa-
rameter number (a value between 0 and M − 1) and the crossover value. These will be
stored as self.param and self.gamma. The node will also need to know which children
are connected to it. This is a binary tree and so the two possible branches are self.K1

376
and self.K2. The node will receive a list of sick and healthy vectors. These are stored as
self.G1 and self.G2.
The top node will be able to consider all of the parameters. The child node, however,
does not consider the parameters that were used by its ancestors. Thus, each node needs
a list of parameters that can be considered in creating the crossover value. This list of
indexes is stored in self.avail. Finally, the node will keep track of the identity of its
mother node as self.mom.
This class also has several functions but only the function names and returns are
shown in Code 25.15 due to the size of the program. The constructor initializes all of
the parameters. The Set function receives the two matrices and a list of possible indexes
which is usually a list list(range(M)). This function will then put the proper values into
the class variables.

Code 25.15 The variable and function names in the Node class.
1 # decidetree . py
2 class Node :
3 def __init__ ( self ) :
4 self . param , self . gamma = -1 , -1
5 self . K1 , self . K2 = -1 , -1
6 self . G1 , self . G2 = [] , []
7 self . avail = []
8 self . mom =-1
9 def Set ( self , G1l , G2l , alist ) :
10 ...
11 def Decide ( self , G1vecs , G2vecs ) :
12 ...
13 def Split ( self , G1vecs , G2vecs ) :
14 ...
15 return lg1 , lg2 , rg1 , rg2
16 def __str__ ( self ) :
17 ...
18 return s

The Decide function is used to determine which of the parameters from self.avail
best separates the given data. This function will set the variables self.param and
self.gamma.
The Split function will decide which data vectors will be sent to the left child or
the right child. This function returns four lists. The first two are the sick and healthy
patients that went to the left node and the other two are the sick and healthy patients
that went to the right node. These will be used in the construction of other nodes.
The final function is str which is used by the print function to print information

377
about the node to the Python console.

25.7.2.4 The Tree

The decision tree is created from several linked nodes. Since it is possible that a real
problem could have several trees a new class is created. The tree consists of a list of
nodes which are store as self.nodes. It also contains a list named self.next. When
a node is created it can create two children nodes which will have to be considered in
subsequent computations. As an example, the first node is nodes[0] and it creates two
children nodes[1] and nodes[2]. The program will then consider nodes[1] to compute
its crossover point and it will create nodes[3] and nodes[4] before nodes[2] has been
considered. So, this list contains the IDs of the nodes that have been created but have
not yet been processed to determine its crossover points. When a node is processed
to determine its internal values and children then it is removed from self.next. The
amount of data passed down to a child is about half of the data that the mother node
has. Eventually, the tree reaches a node that separates its subset of data perfectly and
no children are required for this node. Thus, the list self.next will grow as the initial
nodes are created and then shrink as the tree reaches the end nodes. When self.next is
empty the construction of the tree is complete.
There are several functions associated with the Tree class and the function names
are shown in Code 25.16. The SetDataVecs function receives the list of sick and healthy
data vectors. For this first node this is all of the data, but for the children nodes this
is only the data that is passed down from its mother. The Mother function determines
the parameters for the first node and returns the four lists that it will pass down to its
two children. The MakeKids function will make the two child nodes for a given mother.
It will determine the self.K1, self.K2, and self.mom but the other parameters will be
determined later. An example is shown in Code 25.17 which creates an instance of the
tree in line 2. It then provides the data that was generated and computes the mother
node.
Code 25.18 displays some of the information from this first node. Currently, it is not
connected to children nodes (lines 1 through 4) and the identities of the sick and healthy
patients are contained in lists. In this data set there are 11 sick and 9 healthy patients. It
has been determined that the second node (number 1) best separates the data and that
the crossover point is x = 0.623.
The Iterate gets the next node ID from self.next and then proceeds to determine
its crossover and parameter values. It then separates the data for this node’s children
into the four lists. Finally, it calls MakeKids to make its children nodes. The function
MakeTree continually calls Iterate until the tree is completely built. Code 25.19 com-
putes the first set of children and now the algorithm has enough started to finish the tree
using the MakeTree function.

378
Code 25.16 The titles in the TreeClass.
1 # decidetree . py
2 class Tree :
3 def __init__ ( self ) :
4 self . nodes = {} # empty dictionary : ID = key
5 self . next = [] # list of who to consider next
6 def SetDataVecs ( self , Tvecs , Fvecs ) :
7 self . Tvecs = Tvecs + 0
8 self . Fvecs = Fvecs + 0
9 def Mother ( self ) :
10 ...
11 return lg1 , lg2 , rg1 , rg2
12 def MakeKids ( self , me , lg1 , lg2 , rg1 , rg2 ) :
13 ...
14 def Iterate ( self ) :
15 me = self . next . pop ( 0 )
16 self . nodes [ me ]. Decide ( self . Tvecs , self . Fvecs )
17 lg1 , lg2 , rg1 , rg2 = self . nodes [ me ]. Split ( self . Tvecs ,
self . Fvecs )
18 self . MakeKids ( me , lg1 , lg2 , rg1 , rg2 )
19 def MakeTree ( self ) :
20 while len ( self . next ) > 0:
21 self . Iterate ( )
22 def Trace ( self , query ) :
23 ...
24 return trc , nodes

Code 25.17 Initializing the Tree.

1 >>> import decidetree as dte


2 >>> tree = dte . Tree ()
3 >>> tree . SetDataVecs ( Ts , Fs )
4 >>> lg1 , lg2 , rg1 , rg2 = tree . Mother ()

379
Code 25.18 The information of the mother node.
1 >>> print ( t . nodes [0]. K1 )-
2 1
3 >>> print ( t . nodes [0]. K2 )-
4 1
5 >>> print ( t . nodes [0]. G1 )
6 [0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10]
7 >>> print ( t . nodes [0]. G2 )
8 [0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8]
9 >>> print ( tree . nodes [0]. param )
10 1
11 >>> print ( tree . nodes [0]. gamma )
12 0.623049932763

Code 25.19 Making the tree.

1 >>> tree . MakeKids ( 0 , lg1 , lg2 , rg1 , rg2 )


2 >>> tree . MakeTree ()

25.7.2.5 A Trace

The Trace function is used after the tree is built. This function receives an input data
vector which could be one from the patients used in building the tree or a new patient not
yet seen before. This function will start with the top node and determine if this patient
should go to the left or the right child. The process iterates down the tree until it reaches
an end node. The input is classified as the type of patients in the final branch of the trace.
It returns information about the path that it took going down the tree and the nodes that
it used.
Consider the information from the first patient data[0]. Recall that this is a tuple
and that the third item is the patient’s data. This is the vector data[0][2]. The mother
node determined that the parameter to use was the 1 parameter. For this patient that
measurement was 0.717 as seen in Code 25.20. The top node in the tree is tree.nodes[0]
and it was determined that its crossover point was 0.623 (see Code 25.18). Since 0.717 >
0.623 the decision is to send this patient to the right child (K2). This is node number 2
and as seen this node uses parameter 2 with a crossover point of 0.189.
The process continues. At each node the decision is made as to whether to send the
patient to the left or right. Eventually the process comes to an end node. This is shown
in Code 25.21. In this case the decision from node 2 leads to node 6. This node does not
have any children as denoted by the -1 values for param and gamma. That means that this
node has perfectly separated the data that was given to it. Thus, the end of the tree has
been reached.

380
Code 25.20 Comparing the patient to the first node.

1 >>> data [0][2][1]


2 0.717 09 3 6 53 9 9 67 5 1 73
3 >>> tree . nodes [0]. K2
4 2
5 >>> tree . nodes [2]. param , tree . nodes [2]. gamma
6 (2 , 0 . 1 8 9 4 4 2 8 3 6 0 7 5 4 5 9 3 1 )

Code 25.21 The final node.


1 >>> data [0][2][2]
2 0.257 15 0 2 02 9 8 59 5 9 92
3 >>> tree . nodes [2]. K2
4 6
5 >>> tree . nodes [6]. param , tree . nodes [6]. gamma
6 (-1 , -1)

The entire process is captured in the Trace function and the call is shown in Code
25.22. The input is the vector from the first patient. The trace shows that the decisions
were to go: right, right and right. In this case, the nodes were 0, 2 and 6. The classification
of the nodes[6] is used as the classification of the patient. Line 4 prints the information for
the last node in the trace. The list nodes[6].G1 has 11 entries but the list nodes[6].G2 is
empty. Thus, all of the data that reached this node were sick patients. The input patient
is classified as sick. In this case, the diagnosis of the patient is known. This is printed
in lines 10 and 11. The value of True indicates that the patient was sick and so the tree
classified the patient correctly.

Code 25.22 Running a trace.

1 >>> trc = tree . Trace ( data [0][2] )


2 >>> print ( trc )
3 ( ' RRR ' , [0 , 2 , 6])
4 >>> print ( tree . nodes [6] )
5 Kids -1 -1
6 Lists
7 [0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 9 , 10]
8 []
9 Avail [0 , 3 , 4 , 5 , 6 , 7 , 8 , 9]
10 >>> data [0][1]
11 True

At the end of the decidetree.py module there is a function named Example which

381
shows the steps for the entire process of generating the data, building the tree, and then
running a trace. The input is a seed for the random number generator. If the seed is 20236
then the above results are replicated. Other seed numbers will generate other patients.

Projects

1. Create a list that contains all of the sentences from the play Romeo and Juliet. Each
item in this list is one sentence from the play. Using a linked list, sort the sentences
from shortest to longest.

2. In this project a decision tree is created from the two bacterial genomes. For each
genome create a list of codon frequencies for all genes of sufficient length (more than
128 nucleotides). Declare one of the genomes to be class 1 and the other to be class
0. Using 90% of the vectors from each list create this decision tree. Use the other
10% for testing. Determine the percentage of the testing vectors that the tree can
correctly classify.

382
Chapter 26

Clustering

Measurements extracted from biological systems may be dependent on a large number of


variables in manners that are not yet understood. One method of analyzing such data
sets is to group data vectors that are similar. Once a group is collected then it can
be further analyzed to find the reasons for the similarity. Clustering algorithms are often
used to create these groups and one of the most common of these is the k-means clustering
algorithm. This chapter will focus on the development and use of the k-means and some
useful extensions.

26.1 Purpose of Clustering

Given a set of data vectors {X : ~x1 , ~x2 , ..., ~xN } then object is to group the vectors such
that each group contains only those vectors that are similar to each other. The measure of
similarity is defined by the user for each particular application. The number of clusters can
either be fixed or dynamic depending on the algorithm chosen. The result of the algorithm
will be a set of groups and the constituents of each group is the set of self-similar vectors.
Code 26.1 creates a simple function named CData that generates random data for
clustering. Purely random data would be inappropriate for clustering so this algorithm
generates a small number of random seeds, and then it generates data vectors that are
random deviations from these seeds. In this fashion some of the vectors should be related
to each other through a common seed. These vectors, therefore, should find a reason to
cluster. The variable N is the number of vectors to be generated and the L is the length
of the vectors.
Code 26.2 presents a simple algorithm for comparing one vector to a set of vectors.
The comparison is performed by the absolute subtraction,
X
s= k~t − d~i k, (26.1)
i

383
Code 26.1 The CData function.
1 # clustering . py
2 def CData ( N , L , scale = 1 , K =-1 ) :
3 if K ==-1:
4 K = int ( np . random . rand () * N /20 + N /20)
5 seeds = np . random . ranf ( (K , L ) )
6 data = np . zeros ( (N , L ) , float )
7 for i in range ( N ) :
8 pick = int ( np . random . ranf () * K )
9 data [ i ] = seeds [ pick ] + scale *(0.5* np . random . ranf ( L )-
0.25)
10 return data
11

12 >>> np . random . seed ( 3996 )


13 >>> import clustering as clg
14 >>> data = clg . CData ( 100 ,10)

where the vector ~t is the target and d~i is the i-th data vector. In Code 26.2 diffs is a
matrix that contains the subtraction of the target vector from all of the vectors in vecs.
This command looks a bit odd in that the two arguments of the subtraction do not have
the same dimensions. Python understands this predicament and performs the subtraction
of the target vector with each row of vecs. The result is diffs which is the same dimension
as target. The sum command only sums along axis #1 which is the second dimension in
diffs.

Code 26.2 The CompareVecs function.

1 # clustering . py
2 def CompareVecs ( target , vecs ) :
3 N = len ( vecs )
4 diffs = abs ( target - vecs )
5 scores = diffs . sum ( 1 )
6 return scores
7

8 >>> scores = clg . CompareVecs ( data [0] , data )

The executed command in Code 26.2 computes the comparison of the first vector
with the entire data set. A perfect match of a vector with the target will produce a score of
0. Code 26.3 sorts the scores and creates a plotting file that is shown in Figure 26.1. The
argsort function returns an array of indexes for the data sorted from lowest to highest.
Thus, scores[ag[0]] is the lowest score and scores[ag[-1]] is the highest score. The
ag is an array and it is used as an index in scores[ag]. This will extract the values of

384
scores according to the indexes of ag.

Code 26.3 Saving the data for GnuPlot.

1 >>> import gnu


2 >>> ag = scores . argsort ()
3 >>> gnu . Save ( ' plot . txt ' , scores [ ag ] )
4 gnuplot > plot ' plot . txt '

Figure 26.1: Sorted scores.

This plot is typical for this data using a different vector as a target. There are a few
vectors that are similar to the target and many that are dissimilar. There seems to be a
sharp differentiation between 2 < γ3. Thus, a threshold is chosen to be 2.5 so that any
score less than the threshold is considered to be a good match.
As a control experiment a simple greedy algorithm is created. One vector is chosen
as the target and all of the vectors that are close to it (scoring below the threshold value)
are collected as a single group. Vectors that belong to a group are not considered for
further grouping. This program has obvious problems in that a vector may not belong to
the best group. Consider a case in which vector C ~ is similar to vector A
~ and very similar
~ ~ ~
to vector B. A is chosen as the first target and thus C would be chosen to belong to that
group, thus preventing C ~ from joining the B~ group for which it was better suited. This,
algorithm is merely a control algorithm to which better algorithms can be compared.
Code 26.4 displays the simple function CheapClustering for clustering data by
this greedy method. The data is converted to the list, work, to take advantage of some of

385
the properties of lists. The pop function removes a vector from the list and thus target
becomes this vector and it no longer exists in work. The nonzero function will return a
tuple containing the indexes of those scores that are less than the threshold, and the [::-1]
in Line 9 reverses the indexes so that the largest is first. The list group started in Line 10
collects the vectors that are deemed to be similar to the target in Lines 11 and 13. Once
the group is collected it is appended to the list of clusters in Line 16.

Code 26.4 The CheapClustering function.

1 # clustering . py
2 def CheapClustering ( vecs , gamma ) :
3 clusters = [ ] # collect the clusters here .
4 ok = 1
5 work = list ( vecs ) # copy of data that can be destroyed
6 while ok :
7 target = work . pop ( 0 )
8 scores = CompareVecs ( target , work )
9 nz = nonzero ( less ( scores , gamma ) ) [0][::-1]
10 group = []
11 group . append ( target )
12 for i in nz :
13 group . append ( work . pop ( i ) )
14 clusters . append ( group )
15 if len ( work ) ==1:
16 clusters . append ( [ work . pop (0) ])
17 if len ( work ) ==0:
18 ok = 0
19 return clusters
20

21 >>> clusts = CheapClustering ( data , 2.5 )


22 >>> map ( len , clusts ) # print length of each cluster
23 [26 , 21 , 19 , 11 , 21 , 2]

The ordering of nz from highest to lowest is necessary for the loop starting in Line 12.
Consider a case in which the ordering is from lowest to highest and in this example vectors
2 and 4 are deemed close to the target. The pop function on Line 19 would remove vector
2 from the list. In doing this vector 4 would become vector 3 and in the next iteration the
removal of vector 4 (which would be the next item in nz) would remove the wrong vector.
By considering the vectors from highest to lowest this problem is averted.
In this particular experiment 6 clusters were created and they contained the following
number of members (26, 21, 19, 11, 21, and 2). These clusters will be compared to the
k-means clusters generated in the next section. A good cluster would collect vectors that

386
are similar and thus a single cluster should have a small cluster variance as measured by,
1 X 2
ωk = σk,j , (26.2)
Nk
i

2 is the variance of the i-th element of the k-th cluster, and N is the number
where σk,j k
of vectors in the k-th cluster. For each cluster the variance of the vector elements are
computed and summed. This scalar measures the variance of the vectors in a cluster. For
the example case the variances of the 6 clusters are shown in Code 26.5.

Code 26.5 The ClusterVar function.


1 # kmeans . py
2 def ClusterVar ( vecs ) :
3 a = vecs . std ( 0 )
4 a = ( a **2) . sum () / len ( vecs [0])
5 return a
6

7 >>> for i in range ( 6 ) :


8 print (i , " %.3 f " % ClusterVar ( array ( clusts [ i ] ) ) )
9 0 0.027
10 1 0.020
11 2 0.019
12 3 0.016
13 4 0.020
14 5 0.012

26.2 k-Means Clustering

The k-means clustering algorithm is an extremely popular and easy algorithm. The user
defines the number of clusters, K, and a method by which these clusters are seeded. The
algorithm will then perform several iterations until the clusters do not change. Each
iteration consists of two steps. The first is to assign each vector to a cluster thus creating
the cluster’s constituents. The second is to compute the average of each cluster. If a
vector is determined to belong to a different cluster then it changes the constituency of
the clusters and thus in the next iteration the averages will be different. If the averages are
different then other vectors may shift to new clusters. The process iterates until vectors
do not change from one cluster to another.
The steps are:

1. Initialize K clusters.
2. Iterate until there is no change

387
(a) Assign vectors to clusters
(b) Compute the average of each cluster
(c) Compare the previous clusters to the new clusters. If there is no change between
the two sets then set the STOP condition.

Each cluster is constructed from an initial seed vector. This vector can be a random
vector, one of the data vectors, or some other method as defined by the user. Usually, the
measure of similarity between a vector and a cluster average is a simple distance measure,
but again the user has the opportunity to alter this if an application needs a different
measure.
Code 26.6 displays two possible initiation functions. The function Init1 receives
the number of clusters and the length of vectors and just generates random vectors. The
problem with this approach is that there is no guarantee that a cluster will collect any
constituents. The function Init2 randomly selects one of the data vectors as a seed for
each cluster. It generates a list of indexes and shuffles them in a random order. The first
K indexes of this shuffled order are used as the seed vectors. In this function the take
function contains two arguments. The first is a list of indexes to be taken. The second
is the axis argument and this forces the take function to extract row vectors from data
instead of scalars.

Code 26.6 Initialization functions for k-means.


1 # kmeans . py
2 def Init1 ( K , L ) :
3 clusts = random . ranf ( (K , L ) )
4 return clusts
5

6 def Init2 ( K , data ) :


7 r = list ( range ( len ( data ) ) )
8 np . random . shuffle ( r )
9 clusts = data . take ( r [: K ] ,0 )
10 return clusts

Once an initial set of clusters is generated the next step is to assign each vector to
a cluster. This assignment is based on the closest Euclidean distance from the vector to
each cluster. Code 26.7 displays the AssignMembership function that computes these
assignments. In this function the list mmb is a list that collects the constituents for each
cluster and it contains K lists. The mmb[0] is a list of the members of the first cluster. This
list contains the vector identities, thus if mmb[0] = [0,4,7] then data[0], data[4] and
data[7] are them members of the first cluster. There are two for loops in this function.
The first initializes mmb and the second performs the comparisons and assigns each vector
to a cluster. In the second loop the score for each cluster is contained in the vector sc
and mn indicates which cluster has the best score.

388
Code 26.7 The AssignMembership function.

1 # kmeans . py
2 # Decide which cluster each vector belongs to
3 def AssignMembership ( clusts , data ) :
4 NC = len ( clusts )
5 mmb = []
6 for i in range ( NC ) : for i in range ( len ( data ) ) :
7 sc = zeros ( NC )
8 for j in range ( NC ) :
9 sc [ j ] = sqrt ( (( clusts [ j ]-data [ i ]) **2 ) . sum () )
10 mn = sc . argmin ()
11 mmb [ mn ]. append ( i )
12 return mmb

The next major step is that each cluster needs to be recomputed as the average of
all of its constituents. Thus, if mmb[0] = [0,4,7] then clust[0] will become the average
of the three vectors mmb[0] = [0,4,7] then data[0], data[4] and data[7]. Code 26.8
displays this function as ClusterAverage. On line 7 vecs is the set of vectors for the
i-th cluster. Recall that vecs is actually a matrix where the rows are the data vectors.
Thus, the k-th element of the average vector is the average of the k-th column of vecs.
The mean function on line 8 uses the 0 as an argument to compute the average of the
columns of the matrix.

Code 26.8 The ClusterAverage function.

1 # kmeans . py
2 def ClusterAverage ( mmb , data ) :
3 K = len ( mmb )
4 N = len ( data [0] )
5 clusts = zeros ( (K , N ) , float )
6 for i in range ( K ) :
7 vecs = data . take ( mmb [ i ] ,0 )
8 clusts [ i ] = vecs . mean (0)
9 return clusts

These are the major functions necessary for k-means clustering. The next step is to
create the iterations. Code 26.9 demonstrates the entire k-means algorithm. The initial
cluster clust1 is created on line 4. The ok flag set in line 5 is used to control the loop
in line 6. When ok is False then the loop will terminate. Line 7 places each vector in a
cluster and Line 8 computes the average of the clusters. Line 9 computes the difference
between the current cluster and the previous cluster. If there is no difference then line 11
will set the ok flag to False. Line 13 replaces the old cluster with the current cluster in

389
preparation for the next iteration or the return statement in line 14.

Code 26.9 The KMeans function.


1 # kmeans . py
2 # typical driver
3 def KMeans ( K , data ) :
4 clust1 = Init2 ( K , data )
5 ok = True
6 while ok :
7 mmb = AssignMembership ( clust1 , data )
8 clust2 = ClusterAverage ( mmb , data )
9 diff = ( abs ( ravel ( clust1 )-ravel ( clust2 ) ) ) . sum ()
10 if diff ==0:
11 ok = False
12 print ' Difference ' , diff
13 clust1 = clust2 + 0
14 return clust1 , mmb

Code 26.10 displays an example using the same data and same number of clusters
from the previous section. The variances of these clusters are printed in lines 11 through 16.
These variances are on the whole smaller than those from the greedy algorithm indicating
that the members of these clusters are more closely related than in the previous case.
One of the clusters does have a higher variance than the other clusters. In the
k-means algorithm every vector will be assigned to a cluster. Even a vector that is not
similar to any other vector must be assigned to a cluster. Often this algorithm will end
up with one cluster that collects outliers and has a higher variance. The solution to this
is discussed in Section 26.4. However, it is important to first discuss how to solve more
difficult problems in Section 26.3.

26.3 More Difficult Problems

The Swiss roll problem is one in which data is organized in a spiral. One thousand data
points are shown in Figure 26.2. The data is created by MakeRoll in Code 26.11 which
then displays the creation of the data.
Using ordinary k-means it is possible to cluster the data. Code 26.12 shows the
RunKMeans function which clusters the data using the k-means algorithm. The clusters
are initialized in line 3 and in line 5 through 7 the standard k-means protocol is followed.
Code 26.13 uses the GnuPlotFiles function which will create plot files suitable for
GnuPlot or a spreadsheet.
The results of the k means clustering is shown in Figure 26.3. Each colored region

390
Code 26.10 A typical run of the k-means clustering algorithm.

1 >>> np . random . seed ( 8193 )


2 >>> clust1 , mmb = kmeans . KMeans ( 6 , data )
3 Difference 7.41782845541
4 Difference 2.70456785889
5 Difference 0.180388645499
6 Difference 0.0
7

8 >>> for i in range ( 6 ) :


9 print (i , " %.3 f " % ClusterVar ( data [ mmb [ i ]]) )
10

11 0 0.014
12 1 0.020
13 2 0.018
14 3 0.021
15 4 0.019
16 5 0.0177

Figure 26.2: The Swiss roll data.

391
Code 26.11 The MakeRoll function.
1 # swissroll . py
2 def MakeRoll ( N =1000 ) :
3 data = np . zeros ( (N ,2) , float )
4 for i in range ( N ) :
5 r = np . random . rand ( 2 )
6 theta = 720* r [0] * np . pi /180
7 radius = r [0] + ( r [1]-0.5) *0.2
8 x = radius * np . cos ( theta )
9 y = radius * np . sin ( theta )
10 data [ i ] = x , y
11 return data
12

13 >>> np . random . seed ( 284554 )


14 >>> import swissroll as sss
15 >>> data = sss . MakeRoll ()

Code 26.12 The RunKMeans function.


1 # swissroll . py
2 def RunKMeans ( data , K =4 ) :
3 clust1 = kmeans . Init2 ( K , data )
4 dff = 1
5 while dff > 0:
6 mmb = kmeans . AssignMembership ( clust1 , data )
7 clust2 = kmeans . ClusterAverage ( mmb , data )
8 dff = ( abs ( clust1 . ravel ()-clust2 . ravel () ) ) . sum ()
9 print dff
10 clust1 = clust2 + 0
11 return clust1 , mmb

Code 26.13 The GnuPlotFiles function.


1 >>> clust , mmb = sss . RunKMeans ( data , 4 )
2 >>> sss . GnuPlotFiles ( mmb , data , ' mp ' )
3 gnuplot > plot ' mp0 . txt ' , ' mp1 . txt ' , ' mp2 . txt ' , ' mp3 . txt ' , '
mp4 . txt '

392
represents the members of a cluster. As seen members of one cluster are on two different
parts of the spiral arm. In this results the vectors represent the clusters are not on the
bands. For example, the average of the first cluster is located at (0.58, -0.27). This is in
between the two sections of points denoted by the red diamonds.

Figure 26.3: Clustering after k-means.

This example illustrates one of the main problems that users encounter with applying
a machine learning algorithm to data. It is essential to understand that nature of the
problem so that the algorithm can be used properly. If, in this case, the user wishes to
have clusters restricted to a single arm of the spiral then it is necessary to adjust the
algorithm. There are two possible avenues in which this can be accomplished. The first
is to represent the data in a different coordinate system, and the second is to modify the
k-means algorithm.

26.3.1 New Coordinate System

Knowing that the data is in some sort of spiral is evidence that a different representation
of the data is warranted. Since the data is in a spiral, polar coordinates are warranted. In
other applications the data may need to be transformed by more involved mathematics.
Code 26.14 shows the function GoPolar which performs this translation via
p
r= x2 + y 2 , (26.3)

393
and
y
θ = tan−1 . (26.4)
x

Code 26.14 The GoPolar function.


1 # swissroll . py
2 def GoPolar ( data ) :
3 N = len ( data )
4 pdata = np . zeros ( (N ,2) , float )
5 for i in range ( N ) :
6 x , y = data [ i ]
7 r = np . sqrt ( x * x + y * y )
8 theta = np . atan2 ( y , x )
9 pdata [ i ] = r , theta
10 pdata [: ,0] *= 10
11 return pdata
12

13 >>> pdata = GoPolar ( data )

In this program the function atan2 is used instead of atan because atan2 is sensitive
to quadrants. The answer has a range of 360 degrees, whereas the atan function has a
range of 180 degrees. The result is that each pdata[k] is the polar coordinates of each
data[k]. This function makes one small adjustment in that it multiplies the radius by a
factor of 10 which puts the radial and the angular values on the same scale.
The converted data is now clustered by the same k-means algorithm as shown in
Code 26.15. Note that the data sent to GnuPlotFiles is the Cartesian data and not the
polar data. This is necessary since the plot is in Cartesian coordinates. However, the
clusters are defined from the polar data. The results are shown in Figure 26.4. By simply
casting the data into a different coordinate space the clustering is significantly different
and in this case produces the desired result.

Code 26.15 Calling the k-means function.

1 >>> clust , mmb = kmeans . RunKMeans ( pdata , 4 )


2 >>> gnu . GnuPlotFiles ( mmb , data , ' mp ' )
3 gnuplot > plot ' mp0 . txt ' , ' mp1 . txt ' , ' mp2 . txt ' , ' mp3 . txt ' , '
mp4 . txt '

26.3.2 Modification of k-means

Another approach is to realize that in this case the Euclidean distance between data points
is not the desired metric of similarity. The clusters should follow the trend of the data

394
Figure 26.4: Clustering after converting data to radial polar coordinates.

which is defined by the proximity of data points. Readers will see a spiral but this is merely
an illusion created by the density of data points. Thus, for this case, a better metric is
to measure the geodesic distances between data points. Two points that are neighbors
have a distance measured by the Euclidean distance, but two points that are farther apart
measure their distance as the shortest distance that connects through intermediate points.
Thus, if there are three points A, B, and C the distance between A and C is the distance
from A to B and then B to C. The geodesic distance is the shortest path that connects
data points.
In order to accomplish this modification it is necessary to compute the shortest
distance between all possible pairs of points. The Floyd-Warshall[Cormen et al., 2000]
algorithm performs this task in very few steps. The algorithm contains three nested
for-loops which in Python would run very slow. So, the Python algorithm uses an outer-
addition algorithm that contains two of the for-loops. This function performs,

Mi,j = ai + bj , ∀i, j. (26.5)

The FastFloyd function in Code 26.16 computes the shortest geodesic distance to
all pairs of points. Even this more efficient version of the Floyd-Warshall algorithm can
take a bit of time and the print statement is merely to show the user progress of the
algorithm.
The input to FastFloyd is a matrix of all the Euclidean distances for all pairs

395
Code 26.16 The FastFloyd function.

1 # swissroll . py
2 def FastFloyd ( w ) :
3 d = w + 0
4 N = len ( d )
5 oldd = d + 0
6 for k in range ( N ) :
7 print ( str ( k ) + ' ' , end = ' ' )
8 newd = np . add . outer ( oldd [: , k ] , oldd [ k ] )
9 m = newd >700
10 newd = (1-m ) * newd + m * oldd
11 mask = newd < oldd
12 mmask = 1-mask
13 g = mask * newd + mmask * oldd
14 oldd = g + 0
15 return g

of points. The Floyd-Warshall algorithm will then search for shorter distances using
combinations of intermediate data points. Code 26.17 shows the function Neighbors
function that converts the data to Euclidean distances and then calls FastFloyd. The
result is a matrix that contains the geodesic distances for all possible pairs of points.
Finally, the k-means algorithm is modified. In the original version the vectors were
assigned to the cluster that was closest to the vector in a Euclidean sense. In this new
version the vector is assigned to the cluster that is closest in a geodesic sense. So, the
AssignMembership algorithm is modified. It first finds the data point that is closest to
each cluster. Then, it adds that distance to the geodesic distance of each data point to
this closest point. This is the distance from the cluster to all of the data points. These
distances are computed for all clusters. The last for-loop each data point considers each
data point and finds the cluster that is closest and assigns the data point to that cluster.
Code 26.18 displays the new AssignMembership function. Following it are the
Python commands to run the new k-means algorithm. Note that the ClusterAverage
function comes from the k-means module whereas the AssignMembership function uses
the newly defined function. Figure 26.5 displays the results from this modification. The
results show that the clusters tend to capture points along the spiral arm which is the
desired result.

26.4 Dynamic k-means

The number of clusters in the k-means algorithm is established by the user and usually
with very little information. If too few clusters are created then variance in the clusters

396
Code 26.17 The Neighbors function.

1 # swissroll . py
2 def Neighbors ( data ) :
3 ND = len ( data )
4 d = np . zeros ( ( ND , ND ) , float )
5 for i in range ( ND ) :
6 for j in range ( i ) :
7 a = data [ i ] - data [ j ]
8 a = np . sqrt ( ( a * a ) . sum () )
9 d [i , j ] = d [j , i ] = a
10 return d
11

12 >>> dists = sss . Neighbors ( data )


13 >>> floyd = sss . FastFloyd ( dists )
14 >>> f = floyd **2

Figure 26.5: Clustering after modifying the k-means algorithm.

397
Code 26.18 The AssignMembership function.

1 # swissroll . py
2 def AssignMembership ( clusts , data , floyd ) :
3 mmb = []
4 NC = len ( clusts )
5 ND = len ( data )
6 for i in range ( NC ) :
7 mmb . append ( [] )
8 dists = np . zeros ( ( NC , ND ) , float )
9 for i in range ( NC ) :
10 d = np . zeros ( ND , float )
11 for j in range ( ND ) :
12 t = clusts [ i ] - data [ j ]
13 d [ j ] = np . sqrt ( sum ( t * t ) )
14 mn = d . argmin ()
15 mndist = d [ mn ]
16 dists [ i ] = mndist + floyd [ mn ]
17 for i in range ( ND ) :
18 mn = dists [: , i ]. argmin ()
19 mmb [ mn ]. append ( i )
20 return mmb
21

22 >>> import kmeans


23 >>> diff = 1
24 >>> c1 = kmeans . Init2 ( 5 , data )
25 >>> while diff > 0:
26 mmb = AssignMembership ( c1 , data , f )
27 c2 = kmeans . ClusterAverage ( mmb , data )
28 diff = sum ( abs ( ravel ( c1 )-ravel ( c2 ) ) )
29 print ( diff )
30 c1 = c2 + 0
31 >>> GnuPlotFiles ( mmb , data , ' mp ' )

398
become large. This means that some clusters are collecting vectors that are not self-similar.
If there are too many clusters then some clusters are very similar to others. One method
of approaching this problem is to dynamically change the number of clusters. The system
needs to detect when there are too many or too few clusters and make the appropriate
adjustments.
The variance is measured by Equation (26.2) and remains small as long as the cluster
contains similar constituents. Dissimilar vectors will increase the variance, but Equation
(26.2) does not indicate which vector is the culprit. This can actually be determined but
if there is more than one outlier then the isolation of the outliers does not necessarily
indicate the necessary number of new clusters that are needed. Thus, a simple approach
is to detect that a cluster has a high variance and randomly split its vectors into two new
clusters and then allow the k-means iterations to sort it all out.
To detect if two clusters are similar the cluster average vectors are compared to one
another. If they are similar then the constituents of the two clusters can be combined into
a single cluster. This is also a very simple, but effective approach.
Code 26.19 generates a set of data with six seeds. The data is shown in Figure 26.6.
In this case two of the clusters overlap thus there are five blocks of data.

Code 26.19 A new problem.

1 >>> random . seed ( 234)


2 >>> data = clg . CData ( 1000 , 2 , 0.3 , 5 )
3 >>> c1 = kmeans . Init2 ( 5 , data )
4 >>> clust , mmb = kmeans . KMeans ( 5 , data )
5 >>> sss . GnuPlotFiles ( mmb , data , ' mp ' )
6 gnuplot > plot ' mp0 . txt ' , ' mp1 . txt ' , ' mp2 . txt ' , ' mp3 . txt ' , '
mp4 . txt '

Even knowing that there are five clusters does not guarantee that the k-means will
cluster correctly. The results from Code 26.19 are shown in Figure 26.6. The first cluster
is marked by crosses and shares a block of data points with two other clusters. The
second cluster is marked with xs and includes two blocks of data. Even though the data
is inherently well separated the clustering did not perform as expected.
Code 26.20 shows the inter-cluster variances and then the intra-cluster differences.
In the latter case the cluster numbers are printed before the difference between them.
The inter-cluster variance is used measure the similarities within in a cluster and if it gets
too large then the cluster should be split. In the example case it is evident that the first
cluster should be split. Thus a threshold between 0.007 and 0.010 is needed to define the
clusters that need to split.
The intra-cluster difference measures the similarity between cluster average vectors.
If this is too small then the cluster vectors are close together and the clusters should be
joined. It is obvious from Figure 26.6 that clusters 2 and 4 should be combined. In Code

399
Figure 26.6: Five clusters data.

Code 26.20 Cluster variances.


1 >>> for i in range ( 5 ) :
2 print ( " %.6 f " % kmeans . ClusterVar ( str ( data [ mmb [ i ]]) + ' ' )
, end = ' ' )
3

4 0.000749 0.015767 0.000763 0.000518 0.012182


5

6 >>> for i in range ( 5 ) :


7 for j in range ( i ) :
8 a = clust [ i ] - clust [ j ]
9 d = np . sqrt ( ( a * a ) . sum () )
10 print ( str ( i ) + ' ' + str ( j ) + ' ' + " %.3 f " % d )
11

12 1 0 1.023
13 2 0 0.080
14 2 1 1.046
15 3 0 0.083
16 3 1 0.963
17 3 2 0.084
18 4 0 0.490
19 4 1 0.567
20 4 2 0.495
21 4 3 0.417

400
26.20 it is seen that the difference between these two cluster average vectors is 0.4 while
all other vector pairs have a distance greater than 1.
Dynamic clustering would then separate clusters mmb[1] and mmb[4] into two clus-
ters and combine clusters mmb[0], mmb[2], and mmb[3]. The splitting of a cluster is
performed randomly. Recall that mmb is a list and inside of it is a list for each cluster.
Randomly splitting a list involves creating two new lists and placing the constituents in
either one. In Code 26.21 m1 and m2 are the split of mmb[1]. Likewise m3 and m4 are
the split from mmb[4]. The m5 is the combination of the other three clusters. Figure 26.7
shows the results. The combination works well but the splitting was done randomly and
so vectors from both groups are in both clusters.

Code 26.21 The Split function.

1 # kmeans . py
2 def Split ( mmbi ) :
3 m1 , m2 = [] , []
4 N = len ( mmbi )
5 for i in range ( N ) :
6 r = random . rand ()
7 if r < 0.5:
8 m1 . append ( mmbi [ i ] )
9 else :
10 m2 . append ( mmbi [ i ] )
11 return m1 , m2
12

13 >>> m1 , m2 = kmeans . Split ( mmb [1] )


14 >>> m3 , m4 = kmeans . Split ( mmb [4] )
15 >>> m5 = mmb [0] + mmb [2] + mmb [3]
16 >>> mmb = [ m1 , m2 , m3 , m4 , m5 ]
17 >>> sss . GnuPlotFiles ( mmb , data , ' mp ' )

The final step is to run the k-means as shown in Code 26.22. Figure 26.8 shows the
results which are more in line with the expected results.

26.5 Comments on k-means

As shown in the previous example k-means may not solve the simplest cases without some
aid. Or did it? The final solution shown in Figure 26.6 is better suited for the application.
In reality, the interpretation of the final results is completely up to the user. The danger
of using k-means (or any clustering algorithm) is to trust the results without testing.
Sometimes a different initialization will produce very different clusters. So, in designing a
problem that will be solved by k-means it is necessary to also design a test to see if the

401
Figure 26.7: New clusters after splitting and combining.

Code 26.22 The final clustering.

1 >>> c2 = kmeans . ClusterAverage ( mmb , data )


2 >>> c1 = c2 + 0
3 >>> diff = 1
4 >>> while diff > 0:
5 mmb = kmeans . AssignMembership ( c1 , data )
6 c2 = kmeans . ClusterAverage ( mmb , data )
7 diff = ( abs (( c1-c2 ) ) . sum ()
8 print ( diff )
9 c1 = c2 + 0
10 >>> sss . GnuPlotFiles ( mmb , data , ' mp ' )

402
Figure 26.8: Clusters after running Code 26.22.

clusters are as desired. It may be necessary to compute new clusters, change the data,
change the algorithms, or split and combine clusters.
Finally, large problems may consume too much computer time and so a process
of hierarchical clustering can be employed. Basically, the data is clustered into a small
number of clusters (thus keeping computations to a minimum). Once those clusters are
computed the data in each can be clustered again into smaller sub-clusters.

26.6 Summary

Clustering is a generic class of algorithms that attempts to organize the data in terms
of self-similarity. This is a difficult task as similarity measures may be inadequate. One
of the most popular methods of clustering is k-means which requires the user to define
the number of desired clusters and the similarity metric. The algorithm iterates between
defining clusters and moving data vectors between clusters. It is a very easy algorithm to
implement and can often provide sufficient results.
However, more complicated problems will require modifications to the algorithm.
This will require the user to understand the nature of the data and to define data conver-
sions to improve performance.
Users should be very aware that there is no magic clustering algorithm. It is neces-
sary to understand the problem, the source and nature of the data, and to have expecta-
tions of results. Clustering results should be tested to determine if the clusters have the

403
desired properties as well.

Problems

1. Create a set of vectors of the form cos(0.1x + r) . Each vector should be N in length.
The x is the index (0,1,2,N -1) and r is a random number. Cluster these vectors
using k-means. Plot all of the vectors in a single cluster in a single graph. Repeat
for all clusters. Using these plots show that the k-means clustered.

2. Repeat Problem 1 using cos(0.1x) + r. Compute the clusters using k-means and
plot. Explain what the clustering algorithm did.

3. Modify k-means such that the measure of similarity between two vectors is not the
distance but the inner product.

4. Using Problem 3 repeat Problem 2.

5. Modify k-means so that it will cluster strings instead of vectors. Create many random
DNA strings of length N . Cluster these strings. Each cluster should have a set of
strings in which some of the elements are common. In other words, in the first cluster
contains a set of strings and all of the m-th elements are ’T’. For each cluster find
the positions in the strings that have the same letter.

6. Repeat Problem 5 but for each cluster find the positions in the strings that have
common letters. For example 75% of the m-th element in the strings in the n-th
cluster were ’A’.

7. Hierarchical clustering. Generated data similar to Figure 26.6. Run k-means for
K = 2. Treat each of the clusters as a new data set. Run k-means on each of the
new data sets. Plot the results in a fashion similar to Figure 26.8.

404
Chapter 27

Text Mining

Biological information tends to be more qualitative than quantitative. The result is that
a lot of the information is presented as textual descriptions rather than equations and
numbers. Thus, a field of mining biological texts for information is emerging. Like many
topics in this book this field is large in scope and evolving. Thus, only a few introduc-
tory topics are presented here and readers desiring more information should considered
resources dedicated solely to this topic.

27.1 Introduction

The goal of text mining in this chapter is to extract information from written documents.
While that sounds fairly straight forward, it is in fact a difficult task. A scientific document
presents information in many different forms: text, equations, tables, figures, images, etc.
Each of these requires a separate method of extracting and understanding the information
from the text. For this chapter the concern will be limited to only the text.
Even if the text is extracted and statistically analyzed it is not a direct path to
grasp the content contained within the document. The document offers text but there is
still the desire is to extract an understanding of the ideas therein. This is a most difficult
task that has kept researchers busy for several decades, and will continue to do so. This
chapter will consider simple methods of comparing documents and thereby associating
documents. This is only the basics of a burgeoning field.

27.2 Data

The data set starts with written texts which are now abundantly available from web
resources such as CiteSeer. These are commonly provided as PDF documents which need
to be converted to text files so they can be loaded into Python routines. Some PDF files

405
allow the user to save the file as a text and some will allow the user to copy and paste the
text into a simple text editor. There are also programs available that will convert PDF
files into text files. Programs such as pyPdf[Fenniak, 2011] can be employed to read PDF
files directly in to a Python program.
The text file will contain more than just the text. Symbols will appear where
the original text had equations or images. Furthermore, the text contains punctuation,
capitalizations, and non-alphabetic characters. Since the purpose is to associate text
between documents it is necessary to remove many of these spurious characters.
Code 27.1 shows the Hoover function which cleans up the text string. Line 3
converts all letters to lower case and line 4 converts all newline characters to spaces. Each
letter has an ASCII integer equivalent. The space character is 32 and ’a-z’ is 97-122. The
chr function converts the integer into a character. This function replaces all characters
that do have the correct ASCII code with an empty string, effectively removing these
characters. This step can easily remove more than 10% of the characters from the original
string.

Code 27.1 The Hoover function.


1 # miner . py
2 def Hoover ( txt ) :
3 work = txt . lower ()
4 work = work . replace ( ' \ n ' , ' ' )
5 valid = [32] + list ( range ( 97 ,123) )
6 for i in range ( 256 ) :
7 if i not in valid :
8 work = work . replace ( chr ( i ) , ' ' )
9 return work
10

11 >>> fp = open ( ' pdf . txt ' ) . read ()


12 >>> clean = miner . Hoover ( txt )

27.3 Creating Dictionaries

Python offers several tools that can manipulate long strings of data and the fastest is the
dictionary. For example it may be desired to know the location of every word in the text.
Each word is used as a key and the data for each key is a list of the locations of that
word. The function AllWordDict in Code 27.2 creates a dictionary dct in line 3 which
considers each word in the list work. If the word is not in the dictionary then an entry
is created in line 10 using the word as the key and a list containing the variable i as the
data. If the word is already in the dictionary then the list is appended with the value i
in line 8,

406
Code 27.2 The AllWordDict function.
1 # miner . py
2 def AllWordDict ( txt ) :
3 dct = { }
4 work = txt . split ()
5 for i in range ( len ( work ) ) :
6 wd = work [ i ]
7 if wd in dct :
8 dct [ wd ]. append ( i )
9 else :
10 dct [ wd ] = [ i ]
11 return dct
12

13 >>> dct = miner . AllWordDict ( clean )


14 >>> len ( dct )
15 745

It should be noted that the variable i is the location in the list work and not a
location in the string. For the example text used the work ‘primitives’ appeared in three
locations. In Code 27.3 the first 100 characters of the text are shown and the entry from
the dictionary for the word ‘primitives’ is also shown. As can be see the first returned
value is 1 which corresponds to the second word in the text and not a position in the string.
In many text mining procedures the distance between two words a and b is measured by
the number of words between them instead of the number of characters between them.

Code 27.3 A list of cleaned words.


1 >>> clean [:100]
2 ' image primitives jason m kinser bioinformatics and
computational biology george mason university man '
3 >>> dct [ ' primitives ' ]
4 [1 , 2098 , 2509]

In this example there are 745 individual words. However, many of them are simple
words such as ‘and’, ‘of’, ‘the’, etc. which are not useful. Another concern is that some
words are similar except for their ending: ‘computations’, ‘computational’, etc. Dealing
with these issues is rather involved and for the current discussion a simple approach is
used which can be replaced later. In this simple approach only the first five letters of the
words are used. Words that are shorter than five letters are discarded and words with the
same first five letters are considered to be the same word. This is horrendously simple and
certainly not the approach that a professional system would use. However, this chapter is
designed to demonstrate methods of relating documents and not as concerned with word

407
stemming. Thus, the simple method, which does perform well enough, is favored over a
more involved but significantly better method.
Code 27.4 shows the function FiveLetterDict which modifies AllWordDict to
include only words of five letters or more and to only consider the first five letters. The
number of entries in this dictionary used in this example are nearly half that of the previous
dictionary.

Code 27.4 The FiveLetterDict function.


1 # miner . py
2 def FiveLetterDict ( txt ) :
3 dct = { }
4 work = txt . split ()
5 for i in range ( len ( work ) ) :
6 wd = work [ i ]
7 if len ( wd ) >=5:
8 wd5 = wd [:5]
9 if wd5 in dct :
10 dct [ wd5 ]. append ( i )
11 else :
12 dct [ wd5 ] = [ i ]
13 return dct
14

15 >>> dct = FiveLetterDict ( clean )


16 >>> len ( dct )
17 425

27.4 Methods of Finding Root Words

The use of the first five letters is a very simple (and poorly) performing solution to a
complicated problem. This section presents a few other approaches that could be used.

27.4.1 Porter Stemming

Porter Stemming[Porter, 2011] is a method that attempts to remove suffixes from English
words. This procedure attempts to remove or replace common suffixes such as -ing, -ed,
-ize, -ance, etc. This is not an easy task as the rules do not remain constant. For example,
the word ‘petting’ should be reduced to ‘pet’ where as ‘billing’ reduces to ‘bill’. In one
case one of the double consonants is removed and in the other case it is not. Still more
confounding are words that have one of the target suffixes but it is part of the root word,
such as ‘string’ which ends with ‘ing’.

408
Computer code for almost any language is found at[Porter, 2011] including Python
code. While this program works well for many different words it is not perfect. Code
27.5 shows some of the more disappointing results. These are not shown to belittle the
Porter Stemming but rather to demonstrate that algorithms do not perform perfectly and
the reader should be away of performance issues of programs that they use. Many more
example words were properly stemmed and this example merely shows that stemming is
a very difficult task.

Code 27.5 A few examples the failed in Porter Stemming.

1 >>> import porter


2 >>> ps = porter . PorterStemmer ()
3

4 >>> w = ' running '


5 >>> ps . stem ( w , 0 , len ( w )-1)
6 ' run '
7 >>> w = ' gassing '
8 >>> ps . stem ( w , 0 , len ( w )-1)
9 ' gass '
10

11 >>> ps . stem ( ' conditioning ' ,0 ,11)


12 ' condit '
13 >>> w = ' conditioner '
14 >>> ps . stem ( w , 0 , len ( w )-1)
15 ' condition '
16

17 >>> w = ' runnable '


18 >>> ps . stem ( w , 0 , len ( w )-1)
19 ' runnabl '
20 >>> w = ' doable '
21 >>> ps . stem ( w , 0 , len ( w )-1)
22 ' doabl '
23 >>> w = ' excitable '
24 >>> ps . stem ( w , 0 , len ( w )-1)
25 ' excit '
26

27 >>> w = ' atomizer '


28 >>> ps . stem ( w , 0 , len ( w )-1)
29 ' atom '

409
27.4.2 Suffix Trees

A suffix tree is a common tool for organizing string information. There several flavors
of suffix trees and so the one used here is designed for identifying suffixes that can be
removed. Given a string of letters a suffix tree builds branches at locations in which words
begin to differ. An example is (‘battle’, ‘batter’, ‘bats’) in Figure 27.1. In this case the
first node in the tree is ‘bat’ because all words in the list begin with ‘bat’. At that point
there is a split in the tree as some words have a ‘t’ for the fourth letter and another has
an ‘s’. Along the ‘t’ branch there is another split at the next position. The goal would
be to identify groups of nodes that commonly following a stem. In this case, three of the
four subsequent nodes are common suffixes and the other node (‘t’) is a common addition
before some stems.

Figure 27.1: A simple suffix tree.

In the Porter Stemming an attempt was made to identify a suffix by examining a


single word. In this suffix tree case the attempt takes into consideration other types of
suffixes also associated with this root.

27.5 Document Analysis

In this section the simple task of comparing documents according to word frequencies
is considered. Certainly, document analysis is a far more complicated topic and readers
interested in this topic are encouraged to examine research that exceeds the scope of this
text.
The tasks to be accomplished here are to extract the frequencies of words, to find
words that are seen more (or less) frequently than normal, and to isolate words that are
indicative of the topic.

410
27.5.1 Data

Data consists of documents concerning at least two different topics. In this example the
topics are actin fibers and CBIR (content based image retrieval). These are very different
topics and so there should be a set of words that strong indicate which topic a document
discusses.
The phrase positive documents is applied to those documents concerning a target
topic. In this case, actin fibers are considered to the positive topic and thus all documents
concerning this topic are considered to be positive. All other documents are considered
to be negative. In this case there is only one other topic, but usually there are multiple
documents and so the negative documents are all topics that are not actin fibers.
To facilitate organization documents for a single topic are located in a single direc-
tory. Thus, it is easy to load all documents from a single topic. Each directory should have
several documents and the function AllDcts shown in Code 27.6 gathers the dictionaries
for all documents in a single directory. It receives two arguments of which the first is a list
of dictionaries. Initially, this is an empty list but as more topics are consider it grows. The
second argument is a specific directory. The function is called in line 11 with an empty
dictionary as the input. This creates all of the dictionaries for the actin topic. As seen
there are 23 documents. The second call to AllDcts pursues the CBIR documents and
the list dcts grows to 48. This process can continue for each topic.

Code 27.6 The AllDcts function.


1 # miner . py
2 def AllDcts ( dcts , indir ) :
3 nmlist = GetList ( indir )
4 for i in range ( len ( nmlist ) ) :
5 fname = indir + ' / ' + nmlist [ i ]
6 txt = open ( fname , encoding = " ascii " , errors = "
surrogateescape " ) . read ()
7 clean = Hoover ( txt )
8 dcts . append ( FiveLetterDict ( clean ) )
9

10 >>> dcts = []
11 >>> miner . AllDcts ( dcts , ' data / mining / actin ' )
12 >>> len ( dcts )
13 23
14 >>> miner . AllDcts ( dcts , ' data / mining / cbir ' )
15 >>> len ( dcts )
16 48

The final result is a list of dictionaries. In this case it is known that the first 23
dictionaries are related to actin documents and the next 25 are related to CBIR documents.

411
27.5.2 Word Frequency

The word frequency matrix wfm will contain the frequency of each word in each document
with wfm[i,j] equal to the j-th word in the i-th document. The construction of wfm
begins with the word count matrix wcm which collects the number of times the j-th word
is seen in the i-th document. However, each document has a different set of words and
so it is prudent to collect the list of words from all documents before allocating space for
wcm.
The word list is created from GoodWords shown in Code 27.7. This program loops
through the individual dictionaries and collects all of the words into gw. Since words can
appear in more than one document the set function is used to pare the list down to one
copy of each individual word. The list@list function is used to convert the set back to a
list for processing in subsequent functions. In all of the documents that were considered
there were 8028 unique words that had five or more letters.

Code 27.7 The GoodWords function.


1 # miner . py
2 def GoodWords ( dcts ) :
3 ND = len ( dcts )
4 gw = []
5 for i in range ( ND ) :
6 gw = gw + list ( dcts [ i ]. keys () )
7 gw = set ( gw )
8 gw = list ( gw )
9 return gw
10

11 >>> gw = miner . GoodWords ( dcts )


12 >>> len ( gw )
13 8028

The dimensions of wcm is ND × NW where ND is the number of documents and NW


is the number of unique words. Code 27.8 shows the function WordCountMat which
determines the values of ND and NW and then flows into a nested loop. The loop starting
on line 6 considers each document and the loop starting on line 7 considers each word.
Recall that the entry for a dictionary is a list containing the locations of the word in the
text. So the number of times that a word appears in the text is simply in the length of the
list for the dictionary entry. The example shows the first twelve words in the dictionary
and the number of times that each appears in the individual documents. Recall also that
the dictionary will rearrange its contents and so the words are not in alphabetical order.
The first word is ‘yanag’ which is actually a person’s name. It is seen just once and
that is in document[2]. Code 27.9 shows that the most common word appears 4437 times.
Now, this list of words excludes words that are less than 5 letters and so very common

412
Code 27.8 The WordCountMat function.
1 # miner . py
2 def WordCountMat ( dcts ) :
3 ND = len ( dcts )
4 LW = len ( goodwords )
5 wcmat = np . zeros ( ( ND , LW ) , int )
6 for i in range ( ND ) :
7 for j in range ( LW ) :
8 if goodwords [ j ] in dcts [ i ]:
9 wcmat [i , j ] = len ( dcts [ i ][ goodwords [ j ]] )
10 return wcmat
11

12 >>> wcmat = miner . WordCountMat ( dcts , gw )


13 >>> wcmat . shape
14 (48 , 8028)
15 >>> wcmat [:10 ,:12]
16 array ([[0 , 0 , 0 , 1 , 0 , 1 , 0 , 0 , 0 , 0 , 2 , 0] ,
17 [0 , 1 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 1 , 0] ,
18 [1 , 0 , 0 , 0 , 0 , 1 , 2 , 0 , 1 , 0 , 0 , 0] ,
19 [0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0] ,
20 [0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0] ,
21 [0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 1 , 0] ,
22 [0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0] ,
23 [0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0] ,
24 [0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0] ,
25 [0 , 0 , 0 , 0 , 0 , 0 , 1 , 2 , 0 , 0 , 0 , 0]])
26 >>> gw [:12]
27 [ ' yanag ' , ' signs ' , ' stroh ' , ' scita ' , ' passe ' , ' dryin ' , ' empha
' , ' loose ' , ' loren ' , ' miche ' , ' among ' , ' jerky ' ]

413
words such as ‘the’ and ‘and’ are not included. The location of this word is at position
6380 and as seen the most common word is ‘image’. This makes sense since the second
topic is on a type of image analysis, and ‘image’ is a common word that could easily appear
in the actin documents.

Code 27.9 A few statistics.


1 >>> wcmat . sum (0) . max ()
2 4437
3 >>> wcmat . sum (0) . argmax ()
4 6380
5 >>> gw [6380]
6 ' image '

The frequency of a word is the number of times a word is seen divided by the total
number of words. This is defined as the probability,

Ci,j
P (Wi:j ) = P , (27.1)
j Ci,j

where Ci,j is the i, j component of the word count matrix represented by wcm in the Codes.
Code 27.10 shows the WordFreqMatrix function which converts the wcm to the word
frequency matrix wfm by performing the first order normalization from Equation (27.1)
on each row of the matrix. This shows that document[9] contains almost 75% of the
occurrences of ‘wasps’.
The next step determines if this frequency is above or below the normal which
requires the average of each column be computed. The probability of a word occurring in
any document is computed by,
P
Fi,j
P (Wj ) = P i , (27.2)
i,j Fi,j

where Fi,j is the word frequency matrix represented by wfm in Code 27.10.
The WordProb function in Code 27.11 normalizes each column and computes the
probability of each word occurring in any document. The results for three of the words are
printed and the word ‘wasps’ has a probability of appearing in any document of 3.54×10−3 .
With normalized data in hand it is possible to relate one word to another and
therefore relate sets of documents to each other.

27.5.3 Indicative Words

The overall goal is to classify documents based on their word frequency. Thus, the search is
for documents that have a set of words that are seen frequently in the positive documents

414
Code 27.10 The WordFreqMatrix function.

1 # miner . py
2 def WordFreqMatrix ( wcmat ) :
3 V = len ( wcmat )
4 pmat = zeros ( wcmat . shape , float )
5 for i in range ( V ) :
6 pmat [ i ] = wcmat [ i ]/ float ( wcmat [ i ]. sum () )
7 return pmat
8

9 >>> np . set_printoptions ( precision =4 )


10 >>> wfmat = miner . WordFreqMat ( wcmat )
11 >>> wfmat [:10 ,:5]
12 array ([[ 0. , 0. , 0. , 1. , 0. ],
13 [ 0. , 0.0502 , 0. , 0. , 0. ],
14 [ 1. , 0. , 0. , 0. , 0. ],
15 [ 0. , 0. , 0. , 0. , 0. ],
16 [ 0. , 0. , 0. , 0. , 0. ],
17 [ 0. , 0. , 0. , 0. , 0. ],
18 [ 0. , 0. , 0. , 0. , 0. ],
19 [ 0. , 0. , 0. , 0. , 0. ],
20 [ 0. , 0. , 0. , 0. , 0.6561] ,
21 [ 0. , 0. , 0. , 0. , 0. ]])

Code 27.11 The WordProb function.


1 # miner . py
2 def WordProb ( wcmat ) :
3 vsum = wcmat . sum ( 0 )
4 tot = vsum . sum ()
5 pvec = vsum / float ( tot )
6 return pvec
7

8 >>> wpr = miner . WordProb ( wcmat )


9 >>> wpr [:3]
10 array ([ 3.5483 e-04 , 1.2673 e-05 , 2.5345 e-05])

415
and quite rarely in the negative documents. Furthermore, it is desired that the positive
words appear in a large number of positive documents.
Using the normalized data this process is performed by the IndicWords function
shown in Code 27.12. It computes the word frequency matrix and the probability vector
and then creates two new vectors. The first is pscores which are the scores of the positive
documents. These scores will be high if the words appear mostly in the positive documents
and in several positive documents. The counter to this is nscores which is the same score
for the negative documents. The final score is the ratio of the two with a little value p
included to prevent divide by zero errors. The output is a vector of scores. The highest
score is the most indicative word.

Code 27.12 The IndicWords function.


1 # miner . py
2 def IndicWords ( wcm , pdoc , ndoc , mincount =5) :
3 wfmat = WordFreqMat ( wcm )
4 wpr = WordProb ( wcm )
5 mask = ( wcm . sum (0) > mincount ) . astype ( int )
6 vals = wfmat * wpr * mask
7 pscores = vals [ pdoc ]. sum (0)
8 nscores = vals [ ndoc ]. sum (0)
9 p = 0.001 * nscores . max ()
10 scores = pscores /( nscores + p )
11 return scores

The function is called in Code 27.13 and the highest score is shown to be 603. The
scores are sorted in reversed order so that the highest scores are first in the list. Thus,
the word with the highest score is word 7125 which is shown to be the word ‘actin’. This
makes sense since that this the topic in the positive documents and it is not a word that
should coincidentally appear in the CBIR documents.

Code 27.13 Using the IndicWords function.

1 >>> scores = miner . IndicWords ( wcmat , pdocs , ndocs ,5)


2 >>> scores . max ()
3 603.21 15 36 20 29 93 44
4 >>> ag = scores . argsort () [::-1]
5 >>> ag [0]
6 7125
7 >>> gw [7125]
8 ' actin '

416
27.5.4 Document Classification

The final step is to classify a document. This step should be applied to a new document
rather than the training documents. That is left as an exercise for the readers. Instead
the scores of the training documents are computed.
The process is simply to accumulate the scores of the words that are in a document.
This is a very simple approach and certainly more involved scoring algorithms can be
created. This method, for example, does not consider how many times a word is in a
document. This process also does not consider the length of the document. Code 27.14
shows the process. The value sc is the score and is set to 0. Starting in line 2 the dictionary
from the first document is considered. The index ndx of each word in the dictionary is
obtained and the score for that index is accumulated. The score for this document is
almost 4883. This is a positive document.

Code 27.14 Scoring documents.

1 >>> sc = 0
2 >>> for k in dcts [0]. keys () :
3 ndx = gw . index ( k )
4 sc += scores [ ndx ]
5 >>> sc
6 4882 .9 16 20 08 93 87 92
7

8 >>> sc = 0
9 >>> for k in dcts [25]. keys () :
10 ndx = gw . index ( k )
11 sc += scores [ ndx ]
12 >>> sc
13 473.63457 352814 18

A negative document is considered in the second half of Code 27.14. This is a CBIR
document and as seen the score is quite low.
The real goal is to consider a new document that has not been classified by the
reader. The document is cleaned and its dictionary is created using the same steps as
above. Then the score is computed. If the score is high then it is considered to be a
positive document.

27.6 Summary

The process outlined above was simple compared to more professional approaches. In
its simplicity there were several shortcuts that were taken, yet the process still shows a
great potential for crudely classifying documents. Improvements can include stemming

417
and better scoring algorithms.
This process only considered word frequency and did not consider the relationships
between words. Is there more information if common pairings of words are considered?
Obviously, syntax and meaning were not even discussed but could be valuable tools in
text mining.

Problems

1. The GoodWords function reduced the number of words. Compute the percentage
of words that were kept by this process compared to the original number of unique
words.

2. Gather documents for a single document. What is the most common word in those
documents?

3. Do the frequencies of simple words (’a’, ’the’, ’and’, etc.) change for documents
concerning different topics?

4. Repeat the process that ended in Code 27.14 but use the CBIR documents as the
positive documents. Compute the scores for the same two documents that were
scored in 27.14.

5. Consider two similar topics. Collect documents on two topics ‘gene sequencing’ and
‘gene functionality’. Determine the score (such as in Code 27.14) for each document
and declare if this method can classify closely related documents.

6. Consider two similar topics. Collect documents on two topics ‘gene sequencing’
and ‘gene functionality’. What are the strongly indicative words that can separate
documents for this case? Are these words indicative of the topics?

418
Part IV

Database

419
Chapter 28

Spreadsheet and Databases

Data sets often have multiple entities and efficient queries of that data require that the
data be organized. Consider a biology case in which the data from a species includes
several genomes, each with thousands of genes, each with similarities to other genes, and
a list of publications that deal with the genes. To complicate matters, the publications
could cover a small subset of genes from different genomes. The data set is complicated
and a myriad of queries can be conceived.
This data set contains different data types and various connections between the
data. It is possible to store the data in a flat file which is basically all of the data in one
table, but that would be very inefficient and highly prone to errors. A more organized
approach would be to have tables of data dedicated to individual data types or connections.
For example, one table (or page in a spreadsheet) would contain information about the
genomes. Another table would contain information about the individual genes. Both of
these tables are dedicated to storing information about a unique type of data. A third
table would then be used to contain information about which gene is in each genome. This
table is dedicated to the connections between data types.
Certainly, a spreadsheet could archive several tables of various data types as long
as the amount of data does not exceed the limits of the spreadsheet program. It is also
possible to ask questions of the data using the spreadsheet. However, as the questions
become complicated the inherent limitations of a spreadsheet become apparent. Thus,
comes the need for a DBMS (database management system). This chapter will explore
the use of a spreadsheet in pursuing some queries and the following chapters will explore
the same issues with the use of a DBMS.
A DBMS offers are more utility than just the pursuit of a complicated query. There
are several issues that are raised when dealing with large data, dynamic data, several users
and so on. A spreadsheet program has strict limits when dealing with some of these issues.
These are:

1. Data Redundancy: It is certainly possible to place the same data in multiple

421
locations in the tables. Such a case indicates that the database is poorly designed.
The main problem is that data changes and if it changes in one location but not
another then data disagrees with itself rendering the results to be unreliable.

2. Difficulty in Accession: If person A has the data how can person B get to it?

3. Data Isolation: Different masters store data in different manners. It is possible


that if person A stores data in a weird manner then person B may not be able to
access it all.

4. Integrity: Some types of data must stay within bounds. A zip code should have
five digits.

5. Atomicity: Account A sends $25 to account B. However, during the transfer there
is a fault in one of the computer systems. The money is subtracted from account
A but never added to account B. Such data transfers should be atomic. Either it
completely works or it is aborted.

6. Concurrent Access: Consider a case in which person A and person B both have
access to the same bank account. The account has $100 and both try to withdraw
$60 at the same time. The system should not allow both withdrawals.

7. Security: Who is allowed to see which data? Who is allowed to alter which data?

28.1 The Movie Database

In order to demonstrate the functionality of the different methods of query a simple data
set is employed. The use of scientific databases often come with two different issues. The
first is the question of how to use the query systems and the second is to understand the
contents of the data. As the second concern is not important to the following chapters a
much simpler database will be used to demonstrate the different query commands.
This database is a extremely abridged movie database that contains only a few
movies and only a few actors (or actresses) from those movies. While it is very incomplete
it is sufficient to demonstrate the query processes.
Whether in a spreadsheet or a database, the data is stored in a set of tables. In a
spreadsheet this is a collection of pages in the file. The movie database has seven tables
in two categories. The first category is the collection of tables that contain collected data.
These are:

ˆ The movie table contains the name of the movies, the year released and a quality
grade.

ˆ The actor table contains the first and last names of the actors.

ˆ The country table contains the list of countries from which movies were filmed.

422
ˆ The lang table contains the list of languages in the movies.

The second category are the tables that connect the previous tables together. These
are:

ˆ The isin table connects movies to actors.

ˆ The inlang table connects languages to movies.

ˆ The incountry table connects countries of filming to movies.

In a spreadsheet each row has a unique identifier which is the row number. The
same uniqueness applies to databases in which each table must have a primary key. So,
one column of data is designated as this key and it must contain unique data. In the case
of movies none of the data fields qualify. There are movies with the same name, movies
that are released in the same year and movies that have the same grade. Therefore it is
necessary to have a new field which is simply unique integers that are this primary key.
In a spreadsheet this field looks redundant to the row numbers, but as this data will be
migrated to a database in later chapters the primary key is included for all tables.
The beginning of the movie table is shown in Figure 28.1. There are for fields or
columns. The first is the primary key, mid, which is just incrementing integers. The other
three columns are the name, year and grade. As seen, not all of the fields have values.
While the data is available it is not included to simulate cases of missing data which is
common in a lot of data collection.

Figure 28.1: The movie data.

Figure 28.2 shows the beginning of the actor data. There are three columns which
are the aid (actor ID), first and last names. The data is not sorted in any particular
order. Sorting will be performed during the query. As actors are added to the database
they will be appended to the bottom of the list. It is important that association of actors
with their aid not be changed which means that the data is usually stored in order that
it was collected rather than a sorted order.
It is possible to store this data in a single table. For example a table could be created
that has the movie name, year, grade, and several columns for the names of the actors.
Such a design causes issues. The first is that the number of actors is not a set number and
in fact does not have a maximum value. The next movie recorded could have more actors
than any other movie to date. The second problem is that actors appear in many movies
and so their names would appear in multiple rows. It is possible that one entry could be

423
Figure 28.2: The actor data.

misspelled and then the actor has two different names in the database. One rule of thumb
for designing a database is that the data should not be duplicated. So, the actor’s name
should appear only once as it does in the current design. The third problem involves the
design of the queries. In the proposed flat file it would be easy to find the names of the
actors in a single movie, but it would be cumbersome to find all of the movies from a
single actor.
The proper design then creates one table that contains information about individual
movies. The data contained there have single entries in each field. In other words, this
includes information such as name, year and grade of which each movie only has one value.
Information that has multiple entries such as the list of actors, countries used in filming
and languages are then placed in other tables. The actor table contains information that
is unique to each actor. In this case, that is their names, but a more extensive database
would contain a birth date, birth location, and other information of which each actor has
only one entry.
The connection of the movies and actors is contained in the isin table shown in
Figure 28.3. This table contains three columns. The first is iid which is the primary
key and merely incrementing values. The other two are the mid and aid which relate the
movie ID to the actor ID. The first entry has mid = 3 and aid = 4. In the movie table
the mid of 3 is the movie A Touch of Evil and the actor with aid = 4 is Orson Welles.
In this manner, the isin table connects the actors and movies. It is the same amount of
work to collect all of the actors in a given movie as it is to find all of the movies of a given
actor.

Figure 28.3: The connection between actors and movies.

This database is very small and very incomplete. Here only a few actors are listed
for each movie and some movies have no actors in the database. Furthermore, readers
may disagree with the grade of some movies as this is merely an opinion garnered from
one of several sources.
The fields in each table are shown in Figure 31.1. Each block is a table and the first

424
entry is the primary key, which in this database is always an integer. The rest of the fields
are self-explanatory.
The isin table connects the movies and actors. In a similar fashion the inlang table
connects movies and languages through the mid and lid, and the incountry table connects
movies and countries through a cid. Now it is possible to answer questions such as: Which
countries were used in filming movies that starred Daniel Radcliffe. The query would start
with the name of the actor, fetch his aid, then his movie mid values, and from there the
languages of those movies.

28.2 The Query List

Now that a database is in place it is possible to ask a series of questions or queries. These
queries will be used both in the spreadsheet and database chapters. The goal is to show
how such queries are approached and that spreadsheets are limited in their ability to
retrieve answers to queries.
The list is:

1. What is the name of the movie with mid = 200?

2. List the name of the movies that were released in 1955.

3. List the movies (name, grade, and mid) with a grade of 9 or 10.

4. List the name of the movies that were released in the 1950’s.

5. List the years of the movies that had a grade of 1, but list each year only once.

6. Return the number of actors in the movie with mid = 200.

7. Compute the average grade of the movies from the 1950’s.

8. Compute the average and standard deviation of the length of the movie names.

9. List the first names of the actors whose last name is Keaton.

10. List the first and last names of the actors with the letters “John” in the first name.

11. List the first and last names of the actors that have two parts in the first name field.

12. List the actors with matching initials.

13. List the last names in alphabetical order of all of the actors that have the first name
of John.

14. List the five movies with the longest title.

425
15. List the actors that have the letters “as” in the first name and sort by the location
of those letters.

16. List the average grade for each year.

17. Compute the average grade for each year and sort by the average grade.

18. Compute the average grade for each year and sort by the average grade but the year
must have more than five movies.

19. Return the names of all of the movies that had the actor with mid = 281.

20. Return the names of the movies which had John Goodman as an actor.

21. Compute the average grade of the John Goodman movies.

22. List the titles of all of the movies that are in French.

23. Without duplication, list the languages of the Peter Falk movies.

24. List the movies that have both Daniel Radcliffe and Maggie Smith.

25. List the other actors that are in the movies with Daniel Radcliffe and Maggie Smith.

26. List the mid and title of each movie that has the word “under” in the title along
with the aid of each actor in that movie. Thus, if there are five actors in the movie
then there will be five returned answers, each with that same movie and a different
actor.

27. Return the names of the five actors that are listed as having been in the most movies.

28. Return the names and average grade of the five best actors (those with the highest
average) grade that have been in at least five movies.

29. Compute the average grade for each decade.

30. Using the Kevin Bacon Effect find the two actors that are the farthest apart.

28.3 Answering the Queries in a Spreadsheet

Many of the queries can be answered through manual manipulation of the data in a
spreadsheet. Some of the queries are very difficult to accomplish in this manner. This
section will show how many of the above queries can be answered within the realm of a
spreadsheet. Some of the methods require human intervention which could easily become
untenable if the data set became large.
Query Q1 asks for the name of the movie with mid = 200. This is easily accom-
plished by just scrolling down the movie page until this mid is visible. The answer is Once
Upon a Time in the West.

426
Query Q2 seeks the movies that were released in 1955. There is more than one good
solution to this problem. One method would be to sort the data by the year and then
scroll to the entries from the desired year.
A second method is to use the filter function of the spreadsheet. The filter hides
the rows that do not pass the filter condition. In this case it is possible to set the filter
to show only the rows that have the year 1955. The other rows are not removed they
are merely hidden from view. Figure 28.4 shows the filter dialog in LibreOffice which is
obtained by the menu choices Data:More Filters:Standard Filter. In this case the user
selects to condition that row C must be equal to 1955. The result is shown in Figure 28.5
which shows only the rows where that condition is true. The other rows do exist but they
are hidden from view.

Figure 28.4: The filter popup dialog.

Figure 28.5: The filter results.

Query Q3 seeks the movies with the grade of 9 or better. This can also be ac-
complished by simply sorting or using the filter methods. However, this query requests
that only part of the information be shown and in a certain order. Once the rows have
been isolated by either method the user can manually rearrange the results by cutting
and pasting the columns in the desired order. While this is simple enough to do, it does
require that the user intervene with the query process. In other words, partial result is
obtained and then the user performs more steps to get to the desired result. The process
is not fully automated.

427
Query Q4 seeks the names of the movies from an entire decade. Again this can be
accomplished by sorting the data on the year or using the filter feature of the spreadsheet.
Query Q5 pursues the movies that have been assigned the lowly grade of 1. It is
possible that some years have multiple movies that have this grade and the query asks
that each year be listed only once. The advanced features of the filter are obtained by
selecting the Options box in the lower left of the filter dialog. This reveals a few options
of which one is to remove duplicates. For this query only the data in columns C and D are
used. These are copied to a new spreadsheet page and the filter shown in Figure 28.6 is
applied. The result is a few rows that shows the years in which there is a movie with a
grade of 1 and each year is shown only once.

Figure 28.6: Using the advanced features of the filter to remove duplicates.

Query Q6 is to return the number of actors in the movie that have mid = 200. This
information is obtained from the isin table. In this table there are entries from rows 2 to
2364 and the mid values are in column B. So the formula =COUNTIF(B2:B2364,200) will
count the number of rows in the table that have mid = 200. In this case there are 5.
To obtain the average grade of the movies from the 1950’s (Query Q7) the data in
the movie table can be sorted by year and then the user can select the rows from the
desired decade and use the AVERAGE function to compute the average over the grades
from just the selected years. Once again, the user performs one step to manipulate the
data and then performs a second step to get the final result. The user intervenes in the
process to obtain the desired answer.
Query Q8 seeks the average and standard deviation of the length of the movie names.
The length of a string in a cell is computed by the LEN command. Figure 28.7 shows the
use of the command in which the length of cell A2 is placed in B2. This formula is copied

428
downwards so that the lengths of all of the movie names have been put into column B.
Now the AVERAGE and STDEV functions are used to calculate the results for the values in
this column. The average length is just above 15 characters with a standard deviation
just over 8.

Figure 28.7: The length of a string in a cell.

Query Q9 seeks the first names of the actors who have the last name of Keaton.
In a spreadsheet this can be accomplished by sorting actors on their last names and then
finding the Keatons or once again using the filter tool. In this database there are three
actors that fit this description: Michael, Diane and Buster. Query Q10 seeks actors that
have the letters “John” in their first name. This query is a bit different in that the first
name could also be Johnny or John Q. The filter tool does have the option of a cell
containing a particular value in the Condition pull down menu (see Figure 28.4). There
are, in fact, two actors that are named Johnny and one John C.
Query Q11 asks for actors that have two parts to their first name. This will include
people that have two names, one name and an initial or two initials. In all cases, the
two parts are separated by a space and so the search is for first names that contain a
space character. Once again, the filter tool is useful as it can search for a first name that
contains a space character. However, this will also return a few people that have only one
part to their first name. There are a few actors that have a space after their single name
and these are also returned. Of course, the best solution is that these spaces be removed
from the database. Reality, though, is that data can come from sources other than the
database users and the format of the data may not be to the user’s preference. So, as
an academic exercise, the spaces remain and it is up to the user to define a query that
excludes these individuals. The filter tool allows the user to search on multiple criteria as
shown in Figure 28.8. All three of the Value fields have a single blank space in them.
Query Q12 returns the actors with matching initials in their first and last names.
This requires that the first and last letters of each person be isolated. The function LEFT
grabs the left most characters in a string. To get the first letter in cell B2 the formula
is =LEFT(B2,1). Figure 28.9 shows the solution where cell D2 is the first initial of the
first name and cell E2 is the first initial of the last name. The formula in cell F2 is
=IF(D2=E2,1,0) which places a value of 1 in the cell if the initials match. Once this is
accomplished the data can be sorted on column F to collect the people with the matching
initials. It should be noted that in this method the user had to intervene in the middle of
the process. The sorting stage is applied after the column F is computed.
Query Q13 will list the last names of the actors with the first name of John. This

429
Figure 28.8: Finding individuals with two parts to their first names. Each of the Value fields
contains a single blank space.

Figure 28.9: Finding individuals with the same initials.

430
listing is to be in alphabetical order. Figure 28.10 shows the sorting dialog in LibreOffice
that sorts on two conditions. The first is the sort on the first names which will collect all
of the John’s together and the second is a sort on the last name which alphabetizes the
John’s (as well as other first names) by their last names. The user then needs to find the
set of John actors and extract the results.

Figure 28.10: Sorting on two criteria.

The LEN function is useful for Query Q14 which seeks the movies with the longest
title. In Cell E2 in the movie page the formula to get the length of the name is =LEN(B2).
This formula is copied down for all 800 movies, and then the user can sort on this new
column. Once again, the user must intervene with the process to complete the query.
Query Q15 is similar to previous queries in that the strings in a field are searched for
a particular letter combination. In this case that combination is “as”. However, it needs
to sort the results according to the position of this substring. This is accomplished with
the FIND function. In the actors table, the formula for cell D2 is =FIND("as",B2). In this
case there will be an error code returned because the name in B2 does not have the target
letters. This formula is copied down for all rows, and for those few rows which contain
and actor’s name that has the target letters a value appears. This value is the location of
the first letter of the target. Thus, for Nicholas Cage, the value in the E column becomes
7. The user can then sort on the E column.
Query Q16 seeks the average grade for each year. Certainly, the movie data can
be sorted by year. It is also possible that the user can select to compute the average

431
over a range of movies for a certain year. However, there are about 90 different years in
the database and the number of movies is different for each year. Thus, the user would
then have to write the equation to compute the average for each year as shown in Figure
28.11. That is too tedious, and not a good solution for cases that would have thousands
of segments rather than the 90 in this case.
Query Q17 builds on the previous query so that the data is sorted by the average
grade. If the user slugged through the process of the previous query then they could sort
on the average values that were computed. However, there is a catch. If the data is sorted
on a column that contains formulas then the cells that the formulas used will also be
changed. So, before the user sorts the data on the average grade those values will need to
be copied and pasted as values instead of formulas using Paste Special. This will convert
the formulas to static values and then sort can continue. This query is possible to do but
employs more than one instance of user intervention.

Figure 28.11: A portion of the window that shows the average for each year.

Query Q18 is the same as Q17 except that years that have less than five movies are
to be excluded from the results. The user can start with the spreadsheet used in Query
Q16 and simply eliminate those average calculations for years that have fewer than five
movies. This is doable for this example, but if the query had a thousand segments and
the minimum number was much larger than five then it would be a very tedious task for
the user. Furthermore, the user must be actively involved in seeking the answer.
Query Q19 starts a series of queries that use multiple tables. In this case the query
starts with the actor’s aid and seeks the name of the movies for this actor. This is a two
step process in which the aid is used to fetch the mid values from the isin table, and then
the mid values are used to fetch the movie titles from the movie table.
This query is still possible to do as shown in Figure 28.12. The data in the isin
table is processed by a filter that keeps only the rows in which aid = 281 and as seen
there are four. Column B contains the mid values and these need to be converted into
movie names. Cell D469 contains the formula =OFFSET(movie.B$1,D469,0) which relates
the mid to the movie name as long as the movie data is sorted by the mid. The OFFSET
command positions as cell B1 and then moves downwards with the number of rows being
specified by the value in cell D469. The third argument of 0 indicates that there is no
horizontal shift. If this value were 1 then the information shown in the cell would be from

432
the next column to the right which is the year of the movie.

Figure 28.12: The movies of aid = 281.

Query Q20 takes this idea a step further by starting with the actor’s name. The
name is converted to the aid values in the actors table and then this information is
converted to movie titles as in Q19. The user is heavily involved in the steps of this query
as now two levels of OFFSET are needed. Query Q21 is similar except that there is one
more layer that when the movies are collected that the average grade be computed. In
this query, there is only one actor and therefore only one aid, so the difficulty is not really
elevated compared to Q19. In a case in which the combined average score of 100 actors is
requested, the level of complexity is increased as the transition from actor’s name to aid
needs to be automated.
Query Q22 is similar to Q20 except that the query starts with a language and goes
through the langs and inlang tables rather than actors and isin tables.
Query Q23 starts with the actor’s name and ends with the languages of the actors.
Thus, it uses in order the actors, isin, inlang and langs tables. There is also a caveat that
the languages be listed only once which can be accomplished with the filter tool using the
option to remove duplicates as in Q5.
The logic changes somewhat with Query Q24 which seeks the names of the movies
that star two individuals. The logic flow is shown in Figure 28.13. The box named actor1
starts with the first and last names for the first actor (Maggie Smith) which then converts
this to her aid. The containing box represents the information that is available in the
actor table and the use of the integer in the table merely separates it from the second
use of the table shown directly below. The actor2 table follows the same logic but for
the second actor Daniel Radcliffe. The mid of each actor is converted to their personal
mid values through seperate uses of the table. Then intersection of their mid values are
obtained and sent to the movie table to get the names of the movies.
Query Q19 demonstrated how the actor’s name is converted to an mid, and that
process is used twice in Q24. Finding the common mid values is shown in Figure 28.14.
The first two columns are the mid values from their movies. The formula in cell C2 is
=MATCH(B2,A$2:A$6,0) which returns the location of the match for cell B2. In other
words, the value in B2 is the first item in the list in column A. The next two movies also
find matches, but the rest do not which is indicated by #N/A. The formula in cell D2 is
=OFFSET(A$2,C2-1,0) and this returns the value of the match. Thus, all of the values in
column D are the mid values of the movies that both actors are in. The spreadsheet filter

433
Figure 28.13: The logic flow for obtaining the name of a movie from two actors.

can be used to isolate those from the #N/A entries. Now, that the common movies are
found the process of Query Q1 can be used to extract the names of the movies.

Figure 28.14: Finding the common elements in two lists.

Query Q25 extends this and instead of retrieving the names of the movies, the mid
values would be used to get the actor aid values and then their names. While this query
can be accomplished in a spreadsheet, there are several parts of the query that require
user intervention.
Query Q26 seeks movies with the letters “under” in the title. The twist is that it
also needs to return the aid of the actors in that movie. If a movie has five actors then
the answer should list the movie five times with each time showing a different aid. In a
spreadsheet this challenge starts with the movie title and converts that to the mid then
to multiple aid values and then to actor’s names. The user is heavily involved in walking
this process through the spreadsheet data.
Query Q27 seeks the five actors that have been in the most number of movies. This
requires that the number of movies for each actor be known. It is possible to sort the isin
table on the aid values and then to count the number of rows for each actor aid. This is
similar to the computation of the average grade for each year, in that it is a doable but
very tedious task. Once the number of movies for each actor is known then the user can
sort on those counts.
Query Q28 seeks the average grade which means that the average grade for each
actor must be computed. Furthermore, the user needs to exclude actors with too few

434
movies. Again this is a very tedious task that would be untenable for larger data sets.
Another approach is shown in Figure 28.15 that compares values in multiple columns.
The formula in cell D2 is =COUNTIF(C$2:C$2338,A2). This counts the number of entries
in column C that has the same value as cell A2. The purpose is to count the number of
movies for each actor and since the values in that column are coincidentally the same as
the aid values, this computation also counts the number of movies for the actor with aid
= 1. This formula is copied down and the next step would be to find the maximum value
in this column.

Figure 28.15: Counting the numer of movies for each actor.

Query Q29 seeks the average grade for each decade which, in a spreadsheet, is easier
than the average grade for each year as there are fewer divisions. So the process of Q16 is
repeated with different divisions.
Query Q30 deals with the Kevin Bacon Effect which follows the links between two
actors through common movies. The idea is that one actor has been in a set of movies
which has other actors. Those actors have a set of movies which have different actors.
This process continues until one of the actors is Kevin Bacon. To get the path from a
single actor to Kevin Bacon is tedious but tenable with a spreadsheet. The final query,
however, searches for the shortest such path between any two actors and this is a job for
a computer program.
Most of the queries in the list can be accomplished with a spreadsheet. Some of the
queries, however, are only workable if the data set is small. Some queries require the user
intervention. Intermediate results are returned and then the user must perform an action
such as a filter, a search or a sort. From that process the final answer becomes available.
Thus, the query is not fully automated.
A database management program such as MySQL offers several advantages over a
spreadsheet. These include the ability to have several users and security. It also offers the
advantage of fully automating complex queries. As to be seen in the next chapters, each
of the above queries can be converted to a single MySQL command that returns the final
result. Once the command is written, user intervention is not required.

435
Problems

1. In a the movie spreadsheet get the actors with an aid between 95 and 100 (inclusive).

2. Using the spreadsheet return a list of actors that have only one of the two name
fields with an entry. Some actors go by a single name and so only one field is used.

3. Return an alphabetized list of the last names of actors that have George as a first
name.

4. Return an alphabetized list of the first names of actors that have Mason as a last
name.

5. Using the spreadsheet determine if there are any movies that have actors from both
of the lists in the previous two questions. Basically, is there a movie with one actor
having a first name of George and another actor having the last name of Mason.

6. Using the spreadsheet find the list of languages from movies that are made in Mexico.
This list should not have duplicates.

7. What is the year of the earliest movie made in the UK?

8. What is the year of the earliest movie not made in the USA?

9. Return a list of actors that are in movies that have German as a language. This list
should include first and last names, be alphabetized on the last names, and have no
duplicates.

10. What are the languages of movies starring Mads Mikkelsen?

11. Which actor has the most number of languages associated with their movies?

436
Chapter 29

Common Database Interfaces

There are several options for storing in a database. The website http://db-engines.
com/en/ranking lists almost 300 engines different products according to their popularity.
This chapter will review just three of these as they are viable products for the following
chapters. For each there will be examples on how to load the data, perform the queries
and transfer the data to another program such as a word processor. Creating queries to
perform specified tasks is reserved for the following chapters.

29.1 Differences to a Spreadsheet

The previous chapter explored the use of a spreadsheet for storing data and performing
queries. For small data sets and mild queries a spreadsheet offers a good platform. How-
ever, spreadsheets will falter as the requirements are increased. While spreadsheets are
now allowing multiple user access through cloud services they still lack access control. A
DBMS (database management system) can control what each user can read and what
each user can write. This includes controlling the access in different manners on the same
table. A DBMS is also capable of accessing data that is distributed among many servers.
For large or critical database, distribution is essential.
For the chapters in this book, however, the most important advantage of a DBMS
over a spreadsheet is the ability to automate complicated queries. Excepting the last two
queries, all of the queries in the list in the previous chapter can be performed in a single
command.

29.2 Tables Required

In a spreadsheet all of the data is stored on pages with a two-dimensional array of cells.
Databases hold to this philosophy by placing all data into tables. Each table has fields

437
which contain a single data type. These are similar to the columns used in the movie
database. A field has an associated data type so for example the movie grade can be
contained as integers rather than a string. Each row is called a record or tuple.
Each table must also have a primary key. This is a field in which there are no
duplicated entries. In the case of the movie database the names of the movies could not
be used as a primary key because there are movies with exactly the same title. Likewise,
the years and grades of the movies could not be used as a primary key. As is common
practice, the primary key is an additional column with incrementing integers. This is the
first column in the table. All of the tables in this database use this same philosophy and
have a field on incrementing integers to be the primary key.
Designing a table for a database is important as an improperly designed table will
make queries difficult to construct and could slow the response time. Consider again the
movie database in which a single movie has a year, a grade, several actors, languages, and
countries in which it was filmed. Since a movie has a single name, year and grade these
items could be placed in the same table. The number of actors varies and the same actor
can appear in multiple movies. This is sufficient to require the actors to be contained in
a separate table. The same logic applies to the languages and countries.
Previous chapters explored the use of Python for manipulating information. If the
movie information were stored in Python then one might consider keeping all of the infor-
mation in a list such as [name, year, grade, [actors], [languages], [countries]
]. In this case lists are used to store the information about actors since the number of
actors varies. The rule of thumb is that if it is convenient to store the information in a
list in Python then a new table is needed when storing the information in a database.
The schema of a database is the set of tables and their connections. The schema for
the movie database is shown in Figure 29.1. Each table has a name and in the white boxes
are the names of the fields. The primary keys are denoted as well. The lines connecting
them show the fields that represent the same type of data. For example, both the actor
and isin tables have the actor’s ID. Both of these are labeled aid, but that is merely a
convenience. It is possible that the two fields could have the different names but still
represent the actor’s ID. The line connecting them shows that these two fields represent
the same data. The schema does show that it is possible to travel from any table to any
other table although passing through intermediate tables may be required. In this manner,
the user can see that it is possible to create a query with data from one (or several) tables
and retrieve data from any other table.

29.3 Common Interfaces

There are many DBMS systems available with some being freely so. The most common
are Oracle, MySQL, Microsoft SQL Server, and PostgreSQL. Some of these systems are
designed for industrial data sets while others are designed for personal uses. The three
systems that are reviewed here are sufficient for the rest of the chapters. These are

438
Figure 29.1: The movie database schema.

Microsoft Access, LibreOffice Base and MySQL. All of these products can host a database
or act as a client and connect to a server that contains the data. Furthermore, these three
products all use the MySQL language, so the following examples will work in any of these
environments.
An example, query is shown in Code 29.1 which retrieves the information about the
movie that has the movie ID mid = 200. This query is used here to show how to access
data through the different products. Explanations concerning the components of queries
follow in the next two chapters.

Code 29.1 An example query.


1 SELECT name, year, grade FROM movie WHERE mid=200;

The ensuing subsections show how to establish a table, upload data, submit a query
and copy the data for the Microsoft Access, LibreOffice Base and MySQL.

29.3.1 Microsoft Access

Microsoft Access is a part of the Microsoft Office suite that manages a database. It has
the capability to manage a local database or connect with a database on a server. It
is a personal database manager with some limitations. There are versions of the Office
suite that work on Windows and OSx but not directly on UNIX platforms. There is
a 2 GB limit on the amount of data and a 32K limit on the number of objects in the
database.[Corp., 2016]
Access does have a graphical interface which is useful for non-expert database users.
While it does have many features only the basic steps are shown here which are sufficient
to load data and present a query. Users intending on using this product are encouraged to
read more detailed manuals to gain insight into the full capabilities of Microsoft Access.

439
When Access is started the user is presented with several choices as shown in Figure
29.2. In this case, a new blank database is started and so the first selection is used. One
major convenience of Access is that it can easily create a database by importing data from
Excel. In the following example, the movie data spreadsheet movies.xlsx is used. Figure
29.3 shows the selections to import this data.

Figure 29.2: The opening selection.

A new dialog appears that offers the user choices on how to import the data as seen
in Figure 29.4. The first choice is to create a new table in the database which is the desired
path for this example. The second choice is useful later when data is to be added to a
database table.
The Excel spreadsheet has many pages and each one will be imported individually.
The next dialog that appears allows the user to select the page from the spreadsheet that
is to be imported. The following dialog allows the user to select if the first row contains
the column headings. In this case, the first row of the spreadsheet is the name of the
columns and so the box in the dialog should be checked. If the first row in the spreadsheet
page was the first row of data then this box would be left blank.
Data in a spreadsheet is usually considered to be a string or a double precision float.
The database, on the other hand, has many more data types that can be used, so the user
will need to intervene to select the correct data types for the importing data. The ensuing
dialog is shown in Figure 29.5 which is only the top portion of the dialog. In the movie
page of the spreadsheet there are four columns of data: mid, name, year and grade. The
data type for each of these needs to be established. The figure shows the selection for the
mid field. The user changes the Data Type to Integer as shown. The name column should
be a 100 length VARCHAR, and the other two columns are selected to be integers.
Every table in a database needs to have a primary key. The next dialog allows the
user to select if Access will create a primary key table or if the imported data has the
primary key. In this case, the mid field is the primary key and so the second option of
“Choose my own primary key” is selected and the user selects which field is to be used as
a primary key. The final selection is the name of the table in the database. The default
is that it will be the same name as the page in the spreadsheet. However, the user can

440
Figure 29.3: Importing from Excel.

441
Figure 29.4: Importing choices.

Figure 29.5: Selecting the data type.

442
alter that choice. In this example, the names of the pages in the spreadsheet are also the
names of the tables in the database and so the default values are used.
This concludes the intervention required to import the data from the movie page to
the database. The process needs to be repeated for every page in the spreadsheet. Once
all of the data is uploaded the user should save the file in Access. This is a single file that
can be copied to other computers and a double click on the file icon will start Access and
load the data.
After all of the data is loaded, it is possible to create queries. Figure 29.6 shows the
query selection window that is a graphical interface for creating a query in which the user
makes selection and Access converts the selections into a MySQL query. In this case, only
four of the tables have been loaded and the user can select the tables to use. The fields
behind can be filled in to create a query. However, this process is slow and it is much
more efficient to just write the MySQL query command.

Figure 29.6: Starting the query process.

At the top of the main window there are several tabs of which one of them is the
Query tab. A right-click on this tab brings forth a small menu as shown in Figure 29.7.
The last selection is SQL View which converts the screen to a window where the user
can type in the command directly. The user can then enter the MySQL command in the
window as shown in Figure 29.8.
When the query is executed it returns a table with the response. This is a simple
table format and the data can be painted with the mouse, copied and pasted into a Word
document or an Excel spreadsheet.
While Access has many functions, the ones shown here are the basics on how to
load data from a spreadsheet and perform a query. Users interested in using this product
should invest in reading other resources to learn the capabilities of Microsoft Access.

443
Figure 29.7: Converting the the MySQL command view.

Figure 29.8: Entering the MySQL command.

444
29.3.2 LibreOffice Base

Another choice for a personal database manager is LibreOffice Base. It is similar to Access
in that it provides the ability to host a database on the local computer or access one on
another machine. Some of the advantages of Base is that it is freely available with the
LibreOffice suite and it runs on UNIX as well as Windows and OSx.
Some installations of LibreOffice Base return an error indicating that the user needs
to install JRE (Java runtime environment). This is an unfortunate error as the solution
is slightly different than the error indicates. There are two parts to the solution. The first
part is that the user needs to have JDK (Java Development Kit) installed. Furthermore,
this needs to be the 32-bit version of JDK as LibreOffice is a 32-bit program. The second
part of the solution is that LibreOffice needs to be connected to the JDK. A computer may
have more than one installation of JDK and so it is necessary to select the correct version.
In the Base program the user selects Tools:Options. A new dialog appears and on the left
the user selects LibreOffice:Advanced. In the Vendor panel the user can connect to any
of the JDK systems that are installed. Once connected to the 32-bit version LibreOffice
needs to be restarted.
The initial dialog that appears after starting the program asks the user if they are
starting a new database or connecting to an existing one. Once again the “Create a new
database” selection is chosen. This leads to the next dialog which asks the user to register
the database. Following this is a dialog where the user decides on the name and location
of the file that will be saved. This file will be the database with the extension odb and can
be copied and used on other machines that have LibreOffice installed.
The next window that the user sees is the main interface as shown in Figure 29.9.
Initially, the Tables frame is empty. To load a table the user opens the spreadsheet that
contains the data. There the data to be loaded is painted and copied to the clipboard.
Then the user goes to the database dialog and right clicks on an empty space in the Tables
frame. There are several options and the one to select is Paste.
After Paste is selected the Copy Table dialog appears as shown in Figure 29.10.
Here the user selects the name of the table to be created in the database and if the first
line of the data is to be the field names in the database. In this case, the movie table is
being imported. The user selects the use of the first line as column names if they were in
the copied data.
The next table allows the user to select which columns in the spreadsheet are to be
copied into the database. In this case, all of them are and so the >> is selected and all of
the entries in the left pane are moved to the right pane. This is shown in Figure 29.11.
Figure 29.12 shows the next dialog in which the user defines each of the fields. In
this image the mid field is changed to the Integer data type. Before the user moves on to
the next dialog, the data type for all four of the fields needs to be set.
This is sufficient to upload the data. The user will be asked to automatically set a
primary key column which in this case is rejected since the mid data is being uploaded. To

445
Figure 29.9: The initial dialog.

Figure 29.10: The Copy Table dialog.

Figure 29.11: Selecting data fields.

446
Figure 29.12: Setting the data types.

set the primary key the user right clicks on the movie icon in the database Table window.
Then the user selects the row to be set by a right click on the gray box to the left of the
field name as shown in Figure 29.13. Now, the primary key is set and the user can then
repeat the process for the other pages in the spreadsheet.

Figure 29.13: Setting the primary key.

Once the process has been applied to all of the tables in the spreadsheet the main
dialog appears like the image in Figure 29.14. Now the user is ready to generate a query.
This starts with the selection of the Queries button in the Database panel on the left.
The choices in the Tasks panel change and the last one creates a window for the
user to enter in the MySQL command directly. Once the command is entered then the
query returns results in a table. Unfortunately, moving the results to a word processor
is not as easy as copy and paste. Figure 29.15 shows that the user selects the first and
last gray boxes on the left column to paint the rows of data to be copied. Then the user
right clicks on one of those gray boxes to get a popup menu that has a copy option. Now
the data can be copied into a spreadsheet but not directly to a word processor, but it is
possible to copy from the spreadsheet to the word processor.

447
Figure 29.14: The main dialog.

Figure 29.15: Copying the data to a spreadsheet.

448
29.3.3 MySQL

The final product to be reviewed is MySQL which is available at no cost. This is a


professional grade DBMS which will allow for large databases and many users. However,
in its native form, MySQL has only a command line interface.
It is possible to install just the client side version of MySQL which allows the user
to access a MySQL database residing on a different computer. To host a database on the
local computer, the user needs to install MySQL server. Code 29.2 shows the command
line instruction to connect to the database. The command is mysql followed by some
options. The -D option is the name of the database and the user would replace the word
database with the name of their database. If this option is not used then the user will
need to connect to a database once they are logged in. The -h option is used if the
database is hosted on a different machine. The argument to this option is the address of
the computer hosting the database. If the database is on a local machine then this option
is not used. The -u option is the user’s MySQL user name which may be different than
the name that is used to log on to the computer. If the user has installed the database
on their local machine then the MySQL user name may be root. Finally, the -p option
indicates that the user will need to enter in their MySQL password after they have hit the
Return key.

Code 29.2 Connecting to MySQL.


1 mysql -D database -h hostname.school.edu -u me -p

Successful access to the MySQL system will be rewarded with the prompt mysql>.
Now the system is ready to receive a query command.
The next step is create the tables and upload the data. Every user has a set of
privileges, and it is possible that user may not have privileges to upload data to the
database. The MySQL administrator can change these privileges or find other avenues to
upload the data.
Assuming that the user has sufficient privileges to create tables and to upload data
then the following steps will load data from a spreadsheet to the MySQL database. There
are many other methods to upload data. The first step is to convert each page in the
spreadsheet to a tab delimited CSV file. The second step is to open a command line shell
and changed the directory to the same directory where these CSV files reside. Lines 1 and
2 in Code 29.3 creates a new table named movie. Inside the parenthesis are the details
of the four fields. The first is the mid which is an integer and also the primary key. It is
also set for automatic increments. This means that each time a new entry is added to the
table the value of mid is one more than the previous value. Thus, it is not necessary to
enter the values of mid when the data is entered.
It is also possible that an error can exist in the creation of the table. There is no
control-Z in MySQL. One option of correcting a disastrous error is to start over. This
requires the destruction of the table which is performed in line 3. Then the correct

449
Code 29.3 Creating a table in MySQL.
1 mysql> CREATE TABLE movie (mid INTEGER AUTO_INCREMENT PRIMARY KEY,
2 name VARCHAR(100), year INTEGER, grade INTEGER);
3 mysql> DROP TABLE movie;

command for the creation of the table can be entered. Each command in MySQL is
followed by a semicolon. Failure to include this is not disastrous as MySQL will simply
provide a prompt waiting for the user to complete the command with a semicolon.
Code 29.4 shows the command that will upload a CSV file into an existing table.
The two variables that the user needs to adjust are the name of the CSV file (which in this
example is movies.csv) and the name of the table where the data will be inserted (which
in this case is movies).

Code 29.4 Uploading a CSV file.


1 mysql> LOAD DATA LOCAL INFILE ’movies.csv’ INTO TABLE movies FIELDS
2 TERMINATED BY ’\t’ ENCLOSED BY ’’ ESCAPED BY ’\\’
3 LINES TERMINATED BY ’\n’ STARTING BY ’’;

The process needs to be performed for all pages in the spreadsheet. The user needs
to create the table and then upload the data. This process uses several commands and the
command line interface is not very friendly. A good option is to copy successful MySQL
commands to a text editor. This will allow the user to employ the text editor tools to
create new commands. These, then, can be copied to the command line for execution.
There are many ways to insert data into a table and some of these will be reviewed
in later chapters. However, there is a global alternative that uses the UNIX command
mysqldump. This program is run from the UNIX command line instead of the MySQL
command line. This command can dump an entire database into a text file as shown in
line 1 of Code 29.5. This command will dump the database named database into a text
file named dumpfile.sql. Line 2 is used to load the database stored in this file back into
MySQL. The file dumpfile.sql is a text file and so it can be transferred from one machine
to another. If the file is already available, then the user can use line 2 to create the tables
and load the database.

Code 29.5 Using mysqldump.


1 mysql -u username -p databasename > dumpfile.sql
2 mysql -u username -p databasename < dumpfile.sql

A query is executed through the MySQL command as shown in Code 29.6. Line 1
is the same MySQL command used in the previous examples. The rest is the response
returned by MySQL which can be copied from the command line and pasted to a word

450
processor.

Code 29.6 An example query.


1 mysql> SELECT * FROM movie WHERE mid BETWEEN 200 AND 204;
2 +-----+------------------------------+------+-------+
3 | mid | name | year | grade |
4 +-----+------------------------------+------+-------+
5 | 200 | Once Upon a Time in the West | 1968 | 10 |
6 | 201 | Sleepy Hollow | 1999 | 7 |
7 | 202 | Blow Dry | 2001 | 7 |
8 | 203 | A Foreign Field | 1993 | 7 |
9 | 204 | Capote | 2005 | 9 |
10 +-----+------------------------------+------+-------+
11 5 rows in set (0.43 sec)

The command line interface is very basic and users may prefer a graphical interface.
The MySQL Workbench is an excellent tool that is freely available that will provide a
graphic front end to the MySQL database.

29.4 Summary

There are many different DBMS available. Tools that are suitable for the rest of the
chapters are Microsoft Access, LibreOffice Base and MySQL. The latter two are available
without cost. Any of these products are suitable for personal databases and each uses the
MySQL command language.

451
452
Chapter 30

Fundamental Commands

This chapter will review some of the fundamental MySQL commands that manipulate and
retrieve data from a single table. This will include commands to upload and to receive
answers from queries. Commands that use multiple tables are discussed in Chapter 31.
As the commands are reviewed the appropriate queries from the list in Chapter 28 will be
revealed.

30.1 Loading Data

Code 29.4 showed a method of uploading an entire tab delimited file into a table. This
section will review methods of appending to a table and altering features of a table. The
first few commands are used to set up a database. These are followed by commands to
set up tables and to populate the tables.

30.1.1 Establishing a Database

A user may have several databases within a DBMS. The movies and actors examples use
the movie database, but it is quite possible to generate other databases. The creation of
a database is performed by the CREATE DATABASE command shown in line 1 of Code
30.1. Line 2 selects which database will be used in the subsequent queries.

Code 30.1 Creating a database.


1 mysql> CREATE DATABASE my_new_database;
2 mysql> USE my_new_database;

453
30.1.2 Creating a Table

The creation of tables is performed with the CREATE TABLE command as shown in Code
30.2. In this example the name of the table is movies. Following that is text inside of
parenthesis that defines the attributes (or columns in the tables). Each column gets a
name and a data type. One of the attributes must be defined as the primary key. The
AUTO INCREMENT command indicates that this particular attribute will increment with
each entry. In the first tuple this entry is 1, in the second tuple this entry is 2, and so on.
This is automatic which means that the user will not have to insert data for this attribute.

Code 30.2 Creating a table.


1 mysql> CREATE TABLE movies (mid INTEGER PRIMARY KEY AUTO_INCREMENT,
2 name VARCHAR(100), year INTEGER, grade INTEGER );

The VARCHAR(100) datatype indicates that name is a string that can have up to 100
characters.
The SHOW TABLE command displays the individual tables within a database. The
example in Code 30.3 is performed after all three tables are created. command displays
the individual tables within a database. The example in Code 30.3 is performed after all
three tables are created.

Code 30.3 Showing a table.


1 mysql> SHOW TABLES;
2 +------------------+
3 | Tables_in_cds230 |
4 +------------------+
5 | actor |
6 | country |
7 | incountry |
8 | inlang |
9 | isin |
10 | lang |
11 | movie |
12 +------------------+

Information about an individual table is obtained through the DESCRIBE command.


Code 30.4 shows the command for describing the movies table. As seen the results provide
the name of their attributes, their datatypes and information about the key, default values
(if any) and other information.
Code 30.5 shows the DROP TABLE command which destroys a table and all of the
data within it. It should be noted that there is no CTL-Z command in MySQL. Once a
table is dropped it is completely gone.

454
Code 30.4 Describing a table.
1 mysql> DESCRIBE movies;
2 +-------+--------------+------+-----+---------+----------------+
3 | Field | Type | Null | Key | Default | Extra |
4 +-------+--------------+------+-----+---------+----------------+
5 | mid | int(11) | NO | PRI | NULL | auto_increment |
6 | name | varchar(100) | YES | | NULL | |
7 | year | int(11) | YES | | NULL | |
8 | grade | int(11) | YES | | NULL | |
9 +-------+--------------+------+-----+---------+----------------+

Code 30.5 Dropping a table.


1 mysql> DROP TABLE movies;

30.1.3 Loading Data into a Table

A single row of data is inserted into the database using the INSERT command. The user
can select which columns are being used. The first entry in the movies table is the “A
Face in the Crowd” and the grade is 9. Even though the movie was released in 1957 this
information is not included in this command. Again, since the column mid is an automated
column the user does not supply information for it.
The command to insert this data is shown in Code 30.6. The INSERT INTO command
will add a row to the table. The first set of parentheses indicate which columns are being
supplied with data. This is followed by the keyword VALUES. The second set of parentheses
supply the data. In this case the name of the movie is a string and is thus enclosed in
quotes. Furthermore, since this is data, capitalization is maintained unlike keywords.

Code 30.6 Inserting data.


1 mysql> INSERT INTO movies (name,grade)
2 VALUES (’A Face in the Crowd’, 9);

This command does insert data at the end of table. It is not advisable to insert data
in the middle of the table because it will alter the correlation between tuples and keys.
For example, with the auto incrementing key for the movies table each new movie keys a
unique key. If in this example, “Star Wars” was the next movie added then its mid would
be 2. However, if later “Key Largo” where to be inserted above “Star Wars” then the mid
for “Star Wars” would be changed to 3. This will cause serious problems in the isin table
as now all of the entries for mid that are 2 and greater would need to be altered. So, the
rule of thumb is that data is added to the table at the end.
It is also possible to insert more than one row at a time. Code 30.7 uploads two

455
rows of data. In this case, three columns of data will be used. Following VALUES there
are two sets of parentheses which each supply a single row of data. The number of rows
that can be inserted is not strictly limited. There are two immediate caveats. The first is
that all entries must have the same number of columns and the second is that the total
length of the INSERT INTO command is limited. The latter comes into play if there are
many rows trying to be uploaded in a single command.

Code 30.7 Multiple inserts.


1 mysql> INSERT INTO movies (name,year,grade)
2 VALUES (’A Perfect Couple’,1979, 6),
3 (’A Touch of Evil’,1958,9);

30.2 Updating

Once a table has been created it can be modified. Columns can be added and removed.
The data type of the columns can be modified, but this may also be incompatible with
previously stored data. The ALTER command is used for all table modifications. Code 30.8
shows just two of the many possible uses. Line 1 creates a new column newcol for the
table table. Line 2 changes this column to a BIGINT data type.

Code 30.8 Altering data.


1 mysql> ALTER ONLINE table ADD COLUMN newcol INT;
2 mysql> ALTER table CHANGE newcol BIGINT;

Many other uses include renaming the table or columns, altering the key columns,
managing memory, etc.
An example was that in the first version of the database the movie “Nurse Betty”
was misspelled. Correction was achieved with the UPDATE command as shown in Code
30.9.

Code 30.9 Updating data.


1 mysql> UPDATE movies SET title="Nurse Betty" WHERE mid=84;

30.3 Privileges

The creator of the database has the option of limiting access to the data. Limitations
include blocking access to certain tables or even specific columns. Access can be controlled

456
differently so that some users can add data and others can only read data. These privileges
are controlled through the GRANT command.
Like most commands in MySQL there are a myriad of options which are too numer-
ous to list. Code 30.10 shows just of few of these commands. In Line 1, all privileges are
assigned for all databases to all users.

Code 30.10 Granting privileges.


1 mysql> GRANT ALL ON *.* TO ’someuser’@’somehost’;
2 mysql> GRANT SELECT, INSERT ON *.* TO ’someuser’@’somehost’;
3 mysql> GRANT SELECT, INSERT ON Movies.* TO ’someuser’@’somehost’;
4 mysql> GRANT SELECT (FirstName), INSERT (LastName) ON Movies.Actors
5 TO ’someuser’@’somehost’;
6 mysql> GRANT ALL ON *.* TO ’someuser’@’localhost’;
7 mysql> DROP USER ’badboy’@’localhost’;
8 mysql> GRANT ALL ON *.* TO ’Bill’@’localhost’ IDENTIFIED BY ’mypass’;

Line 2 indicates which commands are available to all users. Line 3 assigns commands
to all users for the Movies database. Line 4 assigns the privilege of SELECT to one column
and INSERT to a second column for the table actors. Line 5 grants privileges to all users
only if they are logged into the host computer. Line 6 eliminates the user named “badboy”.
Line 7 grants privileges to Bill but requires Bill to use the password “mypass”.

30.4 The Simple Query

In the MySQL language every command must end with a semicolon. This allows the user
to write a command that extends multiple lines with each line ending with a typed newline
character. In that fashion, long commands can be typed in an organized manner that is
easier for the user to read. Convention is that MySQL keywords are typed as capital
letters and the user defined fields and variables are typed in lowercase. This is merely
a convenience for the human reader as MySQL does not distinguish between upper and
lowercase commands. Some of the example queries that follow will return long answers
and only the first few rows are printed here.
Finally, the query language shown here is for MySQL. Users of LibreOffice Base
or Microsoft Excel may find that they need to make some minor changes to appease the
dialect of their engine. A couple of notable changes are that some of the field names are
also MySQL keywords. For example, the word year is a field name in the movies table
and also a keyword. If the user is referring to the field name then it may be necessary
to enclose the word in quotes, as in SELECT "year" FROM movies. Another item is that
division of two integers in MySQL returns a float. In LO Base it returns an integer. So, it
is necessary to convert an integer to a float using the CONVERT command, as in SELECT
AVG(CONVERT(grade,float)) FROM movies.

457
The basic query is of the form SELECT field FROM table WHERE condition. The
SELECT field defines the data fields that will be returned. The FROM table defines which
table is being in used. In this chapter, the queries will use only a single table. Queries
with multiple tables are reviewed in Chapter 31. The WHERE condition defines which
records will be returned. Without this part of the command the query would return all of
the data from the table.
Consider again, Query Q1 from Section 28.2. This seeks the name of the movie with
mid = 200. Code 30.11 shows the basic command that selects the name of the movie from
the table named movies for only the film that has mid = 200.

Code 30.11 The basic query.


1 mysql> SELECT title FROM movies WHERE mid=200;
2 +------------------------------+
3 | name |
4 +------------------------------+
5 | Once Upon a Time in the West |
6 +------------------------------+

Query Q2 seeks movies released in the year 1955. The query is shown in Code 30.12.
This command is similar to Code 30.11 except that the condition is changed. As seen the
query returned four movies that fit this condition.

Code 30.12 Selecting movies in a specified year.


1 mysql> SELECT name FROM movie WHERE year=1955;
2 +------------------------+
3 | name |
4 +------------------------+
5 | The Trouble with Harry |
6 | To Catch a Thief |
7 | Ordet |
8 | A Bullet for Joey |
9 +------------------------+

The three basic data types are numbers, dates, and strings. The first, of course,
represents numerical data and the last represents textual data. Databases, particularly in
commerce, also rely heavily on dates and times. Thus, there are data types specifically
dedicated to the representation of time and dates.

30.4.1 Numbers

The most common types of numerical data are integers, decimals and floating point values.
Integers are whole numbers such as ... -2, -1, 0, 1, 2 ... The decimals are non-integers with

458
a dedicated precision. The number of digits before and/or after the decimal place are set
by the user. These types of numbers are useful for currency which has a finite precision
of 1 cent. Floating point numbers are real numbers.

30.4.1.1 Integers

Even within the class of integers there are several different types. These are listed in Table
30.1 and differ in their precision and thus range of values. Integers with a small range
consume viewer bytes. For small applications this is not really a concern, but in extremely
large databases the users must also manage their consumption of disk space.

Table 30.1: Integer Types

Type Bytes Signed Lo Signed Hi Unsigned Lo Unsigned Hi


TINYINT 1 -128 127 0 255
SMALLINT 2 -32768 32767 0 65535
MEDIUMINT 3 -8388608 8388607 0 16777215
INT 4 -2147483648 2147483647 0 4294967296
BIGINT 8 -9223372036854775808 9223372036854775807 0 18446744073709551615

Each integer as a signed and unsigned integer version. The signed versions uses one
bit to represent the sign and thus have one less bit to represent the value. Thus, the
maximum value is just under one half compared to the unsigned versions.
It would seem that the INT type would suffice for most applications as the max-
imum value is over 4 billion. Paris japonica is a plant that sports the largest known
genome,[wikipedia, 2016] with 150 billion bases. So, the INT data type would not be able
to precisely represent the number of bases in a single plant.

30.4.1.2 Decimals

The NUMERIC or DECIMAL data types define a decimal number with a defined number of
digits before and after the decimal point. These are used in cases in which some precision
is required such as in currency. The floating point type (next topic) can induce bit errors
thus presenting $0.01 as $0.0099999. This is not acceptable and in such cases a DECIMAL
or NUMERIC type should be used.
The syntax is myvar DECIMAL(m, d) where m is the total number of digits and d is
the number of digits after the decimal point.

30.4.1.3 Floating Point

A FLOAT or REAL is a generic floating point variable stored in 4 bytes. The DOUBLE or
DOUBLE PRECISION data type uses 8 bytes. It is possible to declare the number of digits

459
Table 30.2: Date and time.

Type Format
DATE ’YYYY-MM-DD’
DATETIME ’YYYY-MM-DD HH:MM:SS’
TIME ’HH:MM:SS’
YEAR(2) or YEAR(4) Year with specified digits

as in myvar FLOAT(m,d) for cases when a bizarre precision is required.

30.4.1.4 Bit

The BIT data type stores a specified number of bits as is myvar BIT(m). This data is
usual for cases in which a small number of values are used. For example, if a variable were
to only have the values 0, 1, 2, or 3 then a BIT type would be far more efficient. Even if
the database is small in size this data type can be useful as it would prevent the variable
from assuming a value outside of the range.

30.4.2 Default Values

A default value is assigned to an attribute as the data is being loaded into the database.
The user, of course, can override the default value. This is an optional argument. An
example is shown in Code 30.13.

Code 30.13 Creating a table with a default value.


1 CREATE TABLE student
2 (pid INT, name TEXT, school VARCHAR(100)
3 DEFAULT ’George Mason’);

30.4.3 Dates

Date and time information can be stored in different formats which are shown in Table
30.2.

30.4.4 Strings

A string is a collection of characters and MySQL offers many different types of strings
since their uses are so varied.
The CHAR(m) data type allocates memory for m characters even if the input data
does not actually need all m characters.

460
The VARCHAR(m) type allows for up to m characters but does not consume all of the
m bytes if the data is shorter.
The BINARY(m) and VARBINARY(m) data types are similar to CHAR and VARCHAR
except that the data is considered to be binary instead of text characters.
A BLOB is similar to the BINARY in that it stores a bytes string without regard to the
ASCII representation of the data. There are four types: TINYBLOB, BLOB, MEDIUMBLOB,
and LONGBLOB which can store lengths of 28-1, 216-1, 224-1, and 232-1 bytes respectively.
The TEXT data type stores long textual strings and comes in four similar types:
TINYTEXT, TEXT, MEDIUMTEXT, and LONGTEXT which can store lengths of 28-1, 216-1, 224-
1, and 232-1 bytes respectively. Thus, a TEXT can store up to 64 kilobytes, a MEDIUMTEXT
can store up to 16 megabytes, and a LONGTEXT can store up to 4 gigabytes.

30.4.5 Enumeration and Sets

An enumeration is a collection of string objects from which the attribute can assume a
value. Basically, the attribute can only be one of the members of the enumeration. A
creation of the a table with an enumeration would be of the form shown in Code 30.14.
Here the variable sizes can only be one of three strings.

Code 30.14 Creating an enumeration.


1 CREATE TABLE sizes (
2 name ENUM (’small’, ’medium’, ’large’);

A SET can have zero of more members and the maximum number of unique strings
is 64.

30.4.6 Spatial Data

MySQL has data types that correspond to OpenGIS classes. Some of these types hold
single geometry values:

ˆ GEOMETRY

ˆ POINT

ˆ LINESTRING

ˆ POLYGON

GEOMETRY can store geometry values of any type. The other single-value types
(POINT, LINESTRING, and POLYGON) restrict their values to a particular geometry type.

461
Table 30.3: Converting data.

Type Format
BINARY Convert to binary
CAST Converts to specified type
CONVERT Converts to specified typec

30.5 Conversions

Data can be stored as one type but during the retrieval can be converted to another type.
The three major functions are shown in Table 30.3.
The example shown in Code 30.15 uses retrieves the mid from the table Movies and
converts this integer into decimal for display. This does not change the stored data but
just the format of the retrieved data.

Code 30.15 Example of CAST.


1 mysql> SELECT CAST(mid AS DECIMAL) FROM movie WHERE year=1980;
2 +----------------------+
3 | CAST(mid AS DECIMAL) |
4 +----------------------+
5 | 314.00 |
6 +----------------------+
7 1 row in set (0.07 sec)

30.6 Mathematics in MySQL

MySQL contains many functions that facilitate the construction of queries. Queries can
include mathematical process of both the query conditions and query response.

30.6.1 Math Operators

Basic math functions are available in MySQL. Standard mathematical operators are shown
in Table 30.4
An example is shown in Code 30.16 where the returned grade of the movie is mul-
tiplied by 2. In this case the original grade was 6 and the returned answer is 12.

462
Table 30.4: Math operators.

Type Description
DIV Integer division
/ Division
- Subtraction
% or MOD Modulus
+ Addition
* Multiplication
- Change sign

Code 30.16 Example of a math operator.


1 mysql> SELECT 2*grade FROM movies WHERE mid=444;
2 +---------+
3 | 2*grade |
4 +---------+
5 | 12 |
6 +---------+
7 1 row in set (0.04 sec)

30.6.2 Math Functions

Table 30.5 shows the math functions which operate on the returned value or the arguments
used in WHERE statements.
Code 30.17 a case in which the math function is applied to the argument rather than
the returned value. In this case the input value is 4.5 but is rounded off and converted to
an integer. This is used as the mid and the mid and title of the movie are returned.

Code 30.17 Example of a math function.


1 mysql> SELECT mid,title FROM movies
2 WHERE mid=ROUND(4.5);
3 +-----+---------+
4 | mid | title |
5 +-----+---------+
6 | 5 | Amadeus |
7 +-----+---------+
8 1 row in set (0.14 sec)

30.6.3 Operators

Other operators are shown in Table 30.6.

463
Table 30.5: Math functions.

Type Description Type Description


ABS Absolute value LOG10 log
ACOS Arccosine LOG2 Base 2 log
ASIN Arcsine LOG Natural log
ATAN Arctangent MOD Remainder
ATAN2 Arctangent PI Value of pi
CEIL Ceiling POW Raise to the power
CEILING Ceiling POWER Raise to the power
COS Cosine RADIANS Degrees to radians
COT Cotangent RAND Random float
CRC32 Redundancy check ROUND Rounds
DEGREES Radians to degrees SIGN Sign
EXP Exponent SIN Sine
FLOOR Floor SQRT Square root
LN Natural log TAN Tangent
TRUNCATE Truncate

Table 30.6: Other operators.

Type Description Type Description


AND, && Logical AND !=, ¡¿ Not equals
& Bitwise AND >, >=, <, <= Comparisons
OR, || Logical OR IS NOT NULL Value test
| Bitwise OR IS NOT Boolean test
ˆ Bitwise XOR IS NULL Value test
= Assign IS Boolean test
:= Assign <<, >> Bit shifts
BETWEEN...AND... Set range LIKE Pattern match
CASE Case operator NOT, ! Negation
<=> Null safe equals REGEXP Regular expression
= Equals to SOUNDS LIKE Compare sounds

464
30.6.4 Hierarchy

It is possible that a MySQL command to have multiple operators. The following list
depicts the operators and the order in which they are executed within a command.

1. INTERVAL
2. BINARY, COLLATE
3. !
4. - (unary minus), (unary bit inversion)
5. ˆ
6. *, /, DIV, %, MOD
7. -, +
8. <<, >>
9. &
10. |
11. = (comparison), <=>, >=, >, <=, <, <>, ! =, IS, LIKE, REGEXP, IN
12. BETWEEN, CASE, WHEN, THEN, ELSE
13. NOT
14. AND, &&
15. XOR
16. OR, ||
17. = (assignment), :=

For example, if an equation was 5 + 6 ∗ 3 the hierarchy as shown would execute the
multiplication before the addition. So this answer is 23. The above list shows the order for
all of the commands in the order that they will be executed. If the uses wishes to change
the order then parentheses are employed. If the desire is for the addition to be performed
first then the user should use (5 + 6) ∗ 3 which produces the answer of 33.

30.6.5 Aggregate Functions

The aggregation functions are shown in Table 30.7. There are commands for simple
mathematical information such as an average or standard deviation.

30.6.6 Sample Queries

The tools in the previous tables are useful for several of the queries from the list in Section
28.2. Query Q3 seeks the list of movies with a grade above a certain value. The query
and the first five returned movies are shown in Code 30.18. The condition for the grade
is that it is greater than or equal to a given value.
Query Q4 seeks the movies in a decade which means that there is an upper and
lower limit on the year. Code 30.19 shows two possible commands that return the same

465
Table 30.7: Aggregate functions.

Type Description
AVG() Return the average value of the argument
BIT AND() Return bitwise and
BIT OR() Return bitwise or
BIT XOR() Return bitwise xor
COUNT(DISTINCT) Return the count of a number of different values
COUNT() Return a count of the number of rows returned
GROUP CONCAT() Return a concatenated string
MAX() Return the maximum value
MIN() Return the minimum value
STD() Return the population standard deviation
STDDEV POP() Return the population standard deviation
STDDEV SAMP() Return the sample standard deviation
STDDEV() Return the population standard deviation
SUM() Return the sum
VAR POP() Return the population standard variance
VAR SAMP() Return the sample variance
VARIANCE() Return the population standard variance

Code 30.18 Selecting movies from a grade range.


1 mysql> SELECT title, grade, mid FROM movies WHERE grade>=9;
2 +------------------------------------+-------+-----+
3 | title | grade | mid |
4 +------------------------------------+-------+-----+
5 | A Face in the Crowd | 9 | 1 |
6 | A Touch of Evil | 9 | 3 |
7 | Amadeus | 9 | 5 |
8 | An American Tail: Fievel Goes West | 10 | 6 |

466
results. The first four results are also shown. In either case, the movie must have a year
between 1950 and 1959 (inclusive) to be returned by this query.

Code 30.19 Selecting movies from a year range.


1 mysql> SELECT mid, title, year FROM movies WHERE year>=1950 AND year<1960;
2 mysql> SELECT mid, title, year FROM movies WHERE year BETWEEN 1950 AND 1959;
3 +-----+-----------------------------+------+
4 | mid | name | year |
5 +-----+-----------------------------+------+
6 | 3 | A Touch of Evil | 1958 |
7 | 248 | Ben-Hur | 1959 |
8 | 250 | Around the World in 80 Days | 1950 |
9 | 266 | Rear Window | 1954 |

The DISTINCT is used to display each returned answer only once. Query Q5 seeks
the years that contained movies with the worst grade of 1. However, there may be some
years that have more than one movie with that grade. The goal is to return that year
only once. The query is shown in Code 30.20. There is a movie that does not have a year
assigned to it and the requirement that year>1900 excludes this movie. The years are
returned and there are no duplicates. There also seems to be no sense to the order that
they are returned. The first movie in the database with a grade of 1 is from the year 2007,
and thus it is the first year shown.

Code 30.20 Selecting years with movie with a grade of 1.


1 mysql> SELECT DISTINCT year FROM movies WHERE grade=1 AND year>1900;
2 +------+
3 | year |
4 +------+
5 | 2007 |
6 | 2005 |
7 | 2004 |
8 | 1993 |
9 | 2003 |
10 | 2008 |
11 | 1995 |
12 | 2009 |
13 | 1997 |
14 | 1999 |
15 +------+

Query Q6 seeks the number of actors from the movie with mid = 200. This query
is not seeking the list of actors, just the number of actors. The command COUNT returns
the number of items returned. The number of entries in the isin table with the mid =

467
200 is the number of actors. Thus, the command in Code 30.21 returns the count of the
number of rows from this table that meet the condition.
Code 30.21 Returning the number of actors from a specified movie.
1 mysql> SELECT COUNT(aid) FROM isin WHERE mid=200;
2 +------------+
3 | COUNT(aid) |
4 +------------+
5 | 5 |
6 +------------+

Query Q7 is to return the average grade of the movies from the 1950’s. The ap-
propriate command to employ is AVG. A solution is shown in Code 30.22. It should be
noted that in some dialects of MySQL that the average over integer values is returned as
an integer. The solution is to convert the data to floats before the average is computed as
in AVG(CONVERT(grade,float)).

Code 30.22 The average grade of the movies in the 1950’s.


1 mysql> SELECT AVG(grade) FROM movies WHERE year BETWEEN 1950 AND 1959;
2 +------------+
3 | AVG(grade) |
4 +------------+
5 | 6.9000 |
6 +------------+

As seen in Code 30.22 the function is listed in the heading over the results. In this
case that heading is not too long, but in other cases that employ multiple functions that
heading can be intruding on the presentation of the results. The solution is to rename
the function with AS as shown in Code 30.23. This renaming actually has a much bigger
purpose. In more complicated queries it is possible that the phrase (such as AVG(grade)
is repeated in the query. The relabeling of that function allows the user to use the new
name throughout the query.

Code 30.23 A demonstration of AS.


1 mysql> SELECT AVG(grade) AS ag FROM movies WHERE year BETWEEN 1950 AND 1959;
2 +--------+
3 | ag |
4 +--------+
5 | 6.9000 |
6 +--------+

Multiple functions are used in Query Q8 which seeks the the average and standard
deviations of the length of the movie titles. A solution is shown in Code 30.24.

468
Code 30.24 Statistics on the length of the movie name.
1 mysql> SELECT AVG(LENGTH(title)), STD(LENGTH(title)) FROM movies;
2 +--------------------+--------------------+
3 | AVG(LENGTH(title)) | STD(LENGTH(title)) |
4 +--------------------+--------------------+
5 | 15.1025 | 8.070749268190655 |
6 +--------------------+--------------------+

30.7 String Functions

There are numerous functions that apply to strings and the tables Table 30.8 through
Table 30.14 display them grouped by subcategories.

Table 30.8: Pattern matching string operators.

Type Description
ASCII Return numeric value of left-most character
BIT LENGTH Return length of argument in bits
CHAR LENGTH Return number of characters in argument
FORMAT Return a number formatted to specified number of decimal places
HEX Return a hexadecimal representation of a decimal or string value
LENGTH Return the length of a string in bytes
ORD Return character code for leftmost character of the argument

Table 30.9: Informative string operators.

Type Description
FIELD() Return the index of the first argument in the subsequent arguments
LIKE Simple pattern matching
LOCATE() Return the position of the first occurrence of substring
MATCH Perform full-text search
NOT LIKE Negation of simple pattern matching
POSITION() Synonym for LOCATE()
SOUNDEX() Return a soundex string
SOUNDS LIKE Compare sounds
STRCMP() Compare two strings
SUBSTRING INDEX() Return a substring of a specified number of occurrences

Query Q9 seeks first names of the actors with the last name of Keaton. The condition
for equating a string is similar for equating a numerical value. Code 30.25 shows this
example.
Query Q10 seeks actors who have “John” in their first name. In this case the first

469
Table 30.10: Informative string operators

Type Description
BIN() Return a string containing binary representation of a number
CHAR() Return the character for each integer passed
ELT() Return string at index number
FIND IN SET() Return the index position of the first argument within the second argument
INSTR() Return the index of the first occurrence of substring
OCT() Return a string containing octal representation of a number
UNHEX() Return a string containing hex representation of a number

Table 30.11: Substring operators.

Type Description
LEFT() Return the leftmost number of characters as specified
MID() Return a substring starting from the specified position
LTRIM() Remove leading spaces
RTRIM() Remove trailing spaces
RIGHT() Return the specified rightmost number of characters
SUBSTR() Return the substring as specified
SUBSTRING() Return the substring as specified
TRIM() Remove leading and trailing spaces

Table 30.12: Capitalization operators.

Type Description
LCASE() Synonym for LOWER()
LOWER() Return the argument in lowercase
UCASE() Synonym for UPPER()
UPPER() Convert to uppercase

Table 30.13: Alteration operators.

Type Description
CONCAT WS() Return concatenate with separator
CONCAT() Return concatenated string
EXPORT SET() Return a string such that for every bit set in the value bits
INSERT() Insert a substring at the specified position up to the specified number of characters
LPAD() Return the string argument, left-padded with the specified string
MAKE SET() Return a set of comma-separated strings that have the corresponding bit in bits set
REPEAT() Repeat a string the specified number of times
REPLACE() Replace occurrences of a specified string
REVERSE() Reverse the characters in a string
RPAD() Append string the specified number of times
SPACE() Return a string of the specified number of spaces

470
Table 30.14: Miscellaneous operators.

Type Description
LOAD FILE() Load the named file
NOT REGEXP Negation of REGEXP
QUOTE() Escape the argument for use in an SQL statement
REGEXP Pattern matching using regular expressions
RLIKE Synonym for REGEXP

Code 30.25 Finding the Keatons.


1 mysql> SELECT firstname FROM actors WHERE lastname=’Keaton’;
2 +-----------+
3 | firstname |
4 +-----------+
5 | Michael |
6 | Diane |
7 | Buster |
8 +-----------+

name is not necessary just those four letters. The LIKE function uses wild cards to
search for a sequence of letters embedded in a text entry. The percent sign is used for an
undetermined number of letters and and underscore is used for a single letter. To find a
first name with any number of letters before and after “John” the percent signs are used
as shown in Code 30.26.

Code 30.26 Finding the Johns.


1 mysql> SELECT firstname,lastname FROM actors WHERE firstname LIKE ’%John%’;
2 +-----------+--------------+
3 | firstname | lastname |
4 +-----------+--------------+
5 | John | Belushi |
6 | Johnny | Depp |
7 | John | Turturro |
8 | John | Candy |

Query Q11 seeks the actors who have two parts in their first name. These two parts
are separated by a single space and so the equivalent search is to find the first names with
a blank space. It is possible to search on a blank space in between two percent signs as
in “% %”. However, this would also include entries that begin or end with a blank space.
Code 30.27 shows a better search which uses the underscores to ensure that there is at
least one character before and one character after the blank space. Combined with the
percent signs this search finds names that have one or more letters before and after the

471
blank space.

Code 30.27 Finding the actors with two parts to their first name.
1 mysql> SELECT firstname,lastname FROM actors WHERE firstname LIKE ’%_ _%’;
2 +----------------+----------+
3 | firstname | lastname |
4 +----------------+----------+
5 | F. Murray | Abraham |
6 | James (Jimmy) | Stewart |
7 | Michael J. | Fox |
8 | M. Emmet | Walsh |

Query Q12 returns the actors that have matching initials in their names. The SUB-
STR function extracts a substring from a string. The function has three arguments which
are the string, the starting location of the extraction, and the length of the extractions.
The first initial, then, is the substring that starts at location 1 and has a length of 1. Code
30.28 shows Query Q12 which is to find the actors that have matching initials. Basically,
the first letter of the first name must be the same as the first letter of the last name.

Code 30.28 Finding the actors with identical initials.


1 mysql> SELECT firstname, lastname FROM actors
2 WHERE SUBSTR(firstname,1,1)=SUBSTR(lastname,1,1);
3 +------------+---------------+
4 | firstname | lastname |
5 +------------+---------------+
6 | Dom | DeLuise |
7 | Alan | Alda |
8 | Nick | Nolte |
9 | Chevy | Chase |

30.8 Limits and Sorts

Queries can return a large number of rows and the user needs to see only a few. One
example would be to find the best record according to a criteria. The query could sort all
of the data, but the user needs to see only the top few returns. Control of the number of
records returned is managed by the LIMIT function. Code 30.29 shows a simple example
that returns just three of the actors with the first name of John. These are the first three
that are stored in the database.
Sorting is controlled by the ORDER BY command. This identifies which field is to
be used in sorting. If the data is text then the sort is alphabetical. If the data is numeric

472
Code 30.29 Example of the LIMIT function.
1 mysql> SELECT lastname FROM actors
2 WHERE firstname=’John’ LIMIT 3;
3 +----------+
4 | lastname |
5 +----------+
6 | Belushi |
7 | Turturro |
8 | Candy |
9 +----------+

then the data is sorted by value. The keywords DESC and ASC are used to indicate if
the sorting the from high to low or low to high with the latter being the default.
Query Q13 is to list the actor’s last names in alphabetical order for those actors
whose first name is John. Code 13 shows the result in which line 2 defines the search
conditions and line 3 sorts the data. Without line 3 the data is returned by the order in
which it was entered into the database. To reverse the order of the returned answer the
command would be changed to ORDER BY lastname DESC.

Code 30.30 Sorting a simple search.


1 mysql> SELECT lastname FROM actors
2 WHERE firstname=’John’
3 ORDER BY lastname;
4 +--------------+
5 | lastname |
6 +--------------+
7 | Belushi |
8 | Candy |
9 | Carradine |
10 | Cleese |

Query Q14 is to list the movies according to the length of their titles. The LENGTH
function returns the length of the string and it is on this function that the sort is to be
applied. The result is shown in Code 30.31. The sort would be from the smallest to the
largest values of the length by default but the DESC command reverses that search and
only the first 5 are printed.
Query Q15 is to sort the actors by the location of the substring ‘as’ in their first
name. This uses the LOCATE function to return the location of a substring within a
string. This function is used twice in this query. The first is to find the locations and the
second is to use that information as the sorting criteria. When a function is used twice
with the same arguments it is both convenient and efficient to rename that application of

473
Code 30.31 The movies with the longest titles.
1 mysql> SELECT title FROM movies
2 ORDER BY LENGTH(title) DESC LIMIT 5;
3 +-------------------------------------------------------------------------+
4 | name |
5 +-------------------------------------------------------------------------+
6 | Everything You Always Wanted to Know About Sex * But Were Afraid to Ask |
7 | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb |
8 | Pirates of the Caribbean: The Curse of the Black Pearl |
9 | Marilyn Hotchkiss’ Ballroom Dancing & Charm School |
10 | The Russians are Coming, the Russians are Coming |
11 +-------------------------------------------------------------------------+

the function with the AS command. The query is shown in Code 30.32. The location of
the target substring is shown in line 1 and renamed as L. Then in line 2 the sorting is over
this same L.

Code 30.32 Sorting actors by the location of ‘as’.


1 mysql> SELECT firstname, lastname, LOCATE(’as’,firstname) AS L
2 FROM actors WHERE firstname LIKE ’%as%’ ORDER BY L;
3 +----------------+----------+------+
4 | firstname | lastname | L |
5 +----------------+----------+------+
6 | Jason | Robards | 2 |
7 | Jasmine | Guy | 2 |
8 | Ceasar | Romero | 3 |
9 | Sebastian | Cabot | 4 |
10 | Sebastian | Koch | 4 |
11 | Lucas | Haas | 4 |
12 | Thomas Haden | Church | 5 |
13 | Nicholas | Cage | 7 |
14 | Jennifer Jason | Leigh | 11 |
15 +----------------+----------+------+

30.9 Grouping

Grouping data in MySQL is the act of collecting data according to a certain criteria.
Consider the case of Query Q16 which is to compute the average movie grade for each
year. Each year can have several movies and so the data needs to be collected by the year.
This action is quite similar to a nested for loop. If this function were to be written

474
in Python then the user would create a for loop over each year and then collect that data
for that year in a nested for loop.
In MySQL the GROUP BY command is used to collect data. This is used over the
same variable that is the first for loop in the Python example. Query Q16 is shown in
Code 30.33. The values to be returned are the year and average grade of the year. Line
3 uses the GROUP BY command to sort the data by the year. For each year, the average
grade is computed.

Code 30.33 Determining the average grade for each year.


1 mysql> SELECT year, AVG(grade) AS g
2 FROM movies WHERE year>1900
3 GROUP BY year;
4 +------+---------+
5 | year | g |
6 +------+---------+
7 | 1928 | 9.0000 |
8 | 1929 | 6.0000 |
9 | 1931 | 10.0000 |
10 | 1932 | 10.0000 |

Query Q17 is to also sort the data from the best year to the worst according to this
average grade. The GROUP BY command is used to collect the data by years and the ORDER
BY command is to change the order of the answer. The query is shown in Code 30.34.

Code 30.34 Sorting the years by average grade.


1 mysql> SELECT year, AVG(grade) AS g FROM movies
2 GROUP BY year
3 ORDER BY g DESC;
4 +------+---------+
5 | year | g |
6 +------+---------+
7 | 1948 | 10.0000 |
8 | 1931 | 10.0000 |
9 | 1932 | 10.0000 |
10 | 1957 | 9.5000 |

This command works well, but includes years that have just a few movies. It is not
really fair to compare the movies of 1948 to other years if 1948 has only one movie. So,
Query Q18 adds the restriction that there must be at least five movies in a year or it is
not considered.
This is the same as putting an if statement inside of the for loop in Python. The
MySQL command is GROUP BY ... HAVING, where the HAVING command acts as the if

475
statement. The example is shown in Code 30.35 where the condition is that there must be
more than 5 movies. The COUNT function is applied to the mid since it is the primary
key.

Code 30.35 Restricting the search to years with more than 5 movies.
1 mysql> SELECT year, AVG(grade) AS g FROM movies
2 GROUP BY year HAVING COUNT(mid)>5
3 ORDER BY g DESC;
4 +------+--------+
5 | year | g |
6 +------+--------+
7 | 1944 | 7.0000 |
8 | 1968 | 6.8750 |
9 | 1975 | 6.8333 |
10 | 2000 | 6.7500 |
11 | 2006 | 6.6579 |

30.10 Time and Date

The functions for dates and time are numerous and simply listed.

ˆ ADDDATE() Add time values (intervals) to a date value


ˆ ADDTIME() Add time
ˆ CONVERT TZ() Convert from one timezone to another
ˆ CURDATE() Return the current date
ˆ CURRENT DATE(), CURRENT DATE Synonyms for CURDATE()
ˆ CURRENT TIME(), CURRENT TIME Synonyms for CURTIME()
ˆ CURRENT TIMESTAMP(), CURRENT TIMESTAMP Synonyms for NOW()
ˆ CURTIME() Return the current time
ˆ DATE ADD() Add time values (intervals) to a date value
ˆ DATE FORMAT() Format date as specified
ˆ DATE SUB() Subtract a time value (interval) from a date
ˆ DATE() Extract the date part of a date or datetime expression
ˆ DATEDIFF() Subtract two dates
ˆ DAY() Synonym for DAYOFMONTH()
ˆ DAYNAME() Return the name of the weekday
ˆ DAYOFMONTH() Return the day of the month (0-31)
ˆ DAYOFWEEK() Return the weekday index of the argument
ˆ DAYOFYEAR() Return the day of the year (1-366)
ˆ EXTRACT() Extract part of a date
ˆ FROM DAYS() Convert a day number to a date

476
ˆ FROM UNIXTIME() Format UNIX timestamp as a date
ˆ GET FORMAT() Return a date format string
ˆ HOUR() Extract the hour
ˆ LAST DAY Return the last day of the month for the argument
ˆ LOCALTIME(), LOCALTIME Synonym for NOW()
ˆ LOCALTIMESTAMP, LOCALTIMESTAMP() Synonym for NOW()
ˆ MAKEDATE() Create a date from the year and day of year
ˆ MAKETIME() Create time from hour, minute, second
ˆ MICROSECOND() Return the microseconds from argument
ˆ MINUTE() Return the minute from the argument
ˆ MONTH() Return the month from the date passed
ˆ MONTHNAME() Return the name of the month
ˆ NOW() Return the current date and time
ˆ PERIOD ADD() Add a period to a year-month
ˆ PERIOD DIFF() Return the number of months between periods
ˆ QUARTER() Return the quarter from a date argument
ˆ SEC TO TIME() Converts seconds to ’HH:MM:SS’ format
ˆ SECOND() Return the second (0-59)
ˆ STR TO DATE() Convert a string to a date
ˆ SUBDATE() Synonym for DATE SUB() when invoked with three arguments
ˆ SUBTIME() Subtract times
ˆ SYSDATE() Return the time at which the function executes
ˆ TIME FORMAT() Format as time
ˆ TIME TO SEC() Return the argument converted to seconds
ˆ TIME() Extract the time portion of the expression passed
ˆ TIMEDIFF() Subtract time
ˆ TIMESTAMP() With a single argument, this function returns the date or datetime
expression; with two arguments, the sum of the arguments
ˆ TIMESTAMPADD() Add an interval to a datetime expression
ˆ TIMESTAMPDIFF() Subtract an interval from a datetime expression
ˆ TO DAYS() Return the date argument converted to days
ˆ UNIX TIMESTAMP() Return a UNIX timestamp
ˆ UTC DATE() Return the current UTC date
ˆ UTC TIME() Return the current UTC time
ˆ UTC TIMESTAMP() Return the current UTC date and time
ˆ WEEK() Return the week number
ˆ WEEKDAY() Return the weekday index
ˆ WEEKOFYEAR() Return the calendar week of the date (0-53)
ˆ YEAR() Return the year
ˆ YEARWEEK() Return the year and week

Code 30.36 shows the simple example of retrieving the current date using the
CURDATE() command. There is an equivalent command for retrieving and another for
both as shown in Code 30.37.

477
Code 30.36 Using CURDATE.
1 mysql> SELECT CURDATE();
2 +------------+
3 | CURDATE() |
4 +------------+
5 | 2015-06-22 |
6 +------------+
7 1 row in set (0.09 sec)

Code 30.37 Right now.


1 mysql> SELECT NOW();
2 +---------------------+
3 | NOW() |
4 +---------------------+
5 | 2015-06-22 17:14:31 |
6 +---------------------+
7 1 row in set (1.11 sec)

30.11 Casting

Table 13.15 displays the casting operators that can change the type of data.

Table 30.15: Casting Operators.

Type Description
BINARY Cast a string to a binary string
CAST() Cast a value as a certain type
CONVERT() Cast a value as a certain type

An example is shown in Code 30.38 where an integer is cast into a decimal.

30.12 Decisions

Every language needs the ability to make decisions and MySQL is no different. There are
two types of decisions which are the CASE and IF statements with variants as shown in
Table 30.16

478
Code 30.38 Casting data types.
1 mysql> SELECT CAST(4 AS decimal);
2 +--------------------+
3 | CAST(4 AS decimal) |
4 +--------------------+
5 | 4.00 |
6 +--------------------+
7 1 row in set (0.05 sec)

Table 30.16: Decision operators.

Type Description
CASE Case operator
IF - ELSE If/else construct
IFNULL Null if/else construct
NULLIF Return NULL if expr1 = expr2

30.12.1 CASE-WHEN

The CASE-WHEN construct has the format of


CASE value WHEN [compare value]
THEN result [WHEN [compare value] THEN result ...] [ELSE result] END
The value is the attribute that is being tested. This is followed by WHEN-
THEN statements that indicate the action (result) if the condition is true (value =
compare value).
The task is to retrieve all of the actors whose first name is ‘David’. If the last name
‘Niven’ then print ‘English’, if the last name is ‘Kelly’ then print ‘Irish’ and for all others
print ‘Dunno’. The answer is shown in Code 30.39 which also uses the ELSE command to
indicate the action if none of the conditions are met. The AS ‘Fun’ component changes
the text in line 9.

30.12.2 The IF Statement

The format of the IF statement is

IF(expr1,expr2,expr3)

where expr2 is the output if expr1 is true and expr3 is the output if expr1 is false. This
is very similar to the IF command format in a spreadsheet.

479
Code 30.39 Using CASE.
1 mysql> SELECT aid, lastname,
2 CASE lastname
3 WHEN ’Kelly’ THEN ’Irish’
4 WHEN ’Niven’ THEN ’English’
5 ELSE ’Dunno’
6 END AS ’Fun’
7 FROM actors WHERE firstname=’David’;
8 +-----+------------+---------+
9 | aid | lastname | Fun |
10 +-----+------------+---------+
11 | 225 | Niven | English |
12 | 244 | Bowie | Dunno |
13 | 339 | Suchet | Dunno |
14 | 486 | Carradine | Dunno |
15 | 519 | Keith | Dunno |
16 | 552 | Straithern | Dunno |
17 | 602 | Kelly | Irish |
18 +-----+------------+---------+
19 7 rows in set (0.03 sec)

This example is to list the mid and last names of the actors whose first name is
‘David’. If their mid is greater than 500 then print ‘Late’ otherwise print ‘Early’. The
result is shown in Code 30.40 and once again the AS command is used to alter the heading
in the print out in line 7.

30.12.3 The IFNULL Statement

The IFNULL statement has the format is

IFNULL(expr1,expr2)

which returns expr1 if expr1 is not NULL. If expr1 is NULL then expr2 is returned.
Two examples are shown in Code 30.41.

30.12.4 Natural Language Comparisons

The example for a full text comparison is shown through a few pieces of script all of which
are from [MyS, ]. The first in Code 30.42 creates a table and attention should be drawn
to line 5 which declares a FULLTEXT index over two of the user defined variables.

480
Code 30.40 Using IF.
1 mysql> SELECT aid, lastname,
2 IF (aid>500,’Late’,’Early’)
3 AS period
4 FROM actors
5 WHERE firstname = ’David’;
6 +-----+------------+--------+
7 | aid | lastname | period |
8 +-----+------------+--------+
9 | 225 | Niven | Early |
10 | 244 | Bowie | Early |
11 | 339 | Suchet | Early |
12 | 486 | Carradine | Early |
13 | 519 | Keith | Late |
14 | 552 | Straithern | Late |
15 | 602 | Kelly | Late |
16 +-----+------------+--------+
17 7 rows in set (0.02 sec)

Code 30.41 Using IFNULL.


1 mysql> SELECT IFNULL(1, ’hi’ );
2 +------------------+
3 | IFNULL(1, ’hi’ ) |
4 +------------------+
5 | 1 |
6 +------------------+
7 1 row in set (0.02 sec)
8

9 mysql> SELECT IFNULL(NULL, ’hi’ );


10 +---------------------+
11 | IFNULL(NULL, ’hi’ ) |
12 +---------------------+
13 | hi |
14 +---------------------+
15 1 row in set (0.02 sec)

481
Code 30.42 The FULLTEXT operator.
1 mysql> CREATE TABLE articles (
2 id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
3 title VARCHAR(200),
4 body TEXT,
5 FULLTEXT (title,body)
6 ) ENGINE=MyISAM;

Code 30.43 Load data.


1 mysql> INSERT INTO articles (title,body) VALUES
2 (’MySQL Tutorial’,’DBMS stands for DataBase ...’),
3 (’How To Use MySQL Well’,’After you went through a ...’),
4 (’Optimizing MySQL’,’In this tutorial we will show ...’),
5 (’1001 MySQL Tricks’,’1. Never run mysqld as root. 2. ...’),
6 (’MySQL vs. YourSQL’,
7 ’In the following database comparison ...’),
8 (’MySQL Security’,’When configured properly, MySQL ...’);

The second step in Code 30.43 loads in some data.


The third step is shown in Code 30.44 where the MATCH-AGAINST construct is em-
ployed. The match is performed on the index that was defined in Code 30.42. This
command performs a natural language search and the results are returned in order of
relevance.

Code 30.44 Using MATCH-AGAINST.


1 mysql> SELECT * FROM articles
2 WHERE MATCH (title,body) AGAINST (’database’);
3 +----+-------------------+------------------------------------------+
4 | id | title | body |
5 +----+-------------------+------------------------------------------+
6 | 5 | MySQL vs. YourSQL | In the following database comparison ... |
7 | 1 | MySQL Tutorial | DBMS stands for DataBase ... |
8 +----+-------------------+------------------------------------------+
9 2 rows in set (0.00 sec)

A natural language search uses the command QUERY EXPANSION as shown in Code
30.45. The search is again on the word ‘database’ but a second search is performed that
uses returned words from the first search as the query in the second search. In this case
the first search returned MySQL which responsible for the third item returned.

482
Code 30.45 Using QUERY-EXPANSION.
1 mysql> SELECT * FROM articles
2 -> WHERE MATCH (title,body)
3 -> AGAINST (’database’ WITH QUERY EXPANSION);
4 +----+-------------------+------------------------------------------+
5 | id | title | body |
6 +----+-------------------+------------------------------------------+
7 | 1 | MySQL Tutorial | DBMS stands for DataBase ... |
8 | 5 | MySQL vs. YourSQL | In the following database comparison ... |
9 | 3 | Optimizing MySQL | In this tutorial we will show ... |
10 +----+-------------------+------------------------------------------+

30.13 Problems

1. Write a single MySQL command that will return the name of the movie that has
mid = 300.

2. Write a single MySQL command that returns the highest grade of the two movies
with mid = 300 or mid = 301.

3. Write a single MySQL command that returns the lowest grade of movies from the
1960’s.

4. Write a single MySQL command that returns the names and years of the movie with
the lowest grade from the 1960’s. (Use the result from the previous problem.)

5. Write a single MySQL command that returns the number of Harry Potter movies
that are in the database.

6. Write a single MySQL command that returns the number of movies that have the
language with lid = 6.

7. Write a single MySQL command that returns the first and last names of the actors
who have a last name that begins with ‘As’.

8. Write a single MySQL command that returns the first and last name of the actors
that have the same last three letters in their first and last names. (Example, Slim
Jim has the same last three letters in the first and last names.)

483
484
Chapter 31

Queries with Multiple Tables

The previous queries captured information from a single table. This chapter will consider
queries that require multiple tables.

31.1 Schema and Linking Tables

The query Q19 starts with the actor’s aid and requests the titles of the movies that this
actor has been in. The aid information is contained in the isin and actors tables while the
title information is stored in the movies table. Thus, the query must involve more than
one table.
The database schema, or the design of the tables, contains equivalent fields in mul-
tiple tables. For this query, it is important to note that the isin table and the movies table
both contain the movie mid value. In this schema, both fields are also named mid but this
is a convenience rather than a requirement.

31.1.1 Schema

A properly designed schema will allow the user to create queries to answer all needed
questions. Often the design of the schema begins with a collection of the questions that
are expected to be asked of the database.
The schema for the movies database is shown in Figure 31.1. The fields of each table
are listed. The primary key is the first entry and denoted by a symbol. The lines between
the tables show the fields that are common with the tables. In this view it is possible to
see that all tables are connected and so it is possible to start with any type of information
and pursue the answer that is in another table. If the query were to find the actors that
were in movies from a certain country, this schema figure shows that the query would need
to route through the country, incountry, isin and actors tables. This information would
then lead to the construction of the query.

485
Figure 31.1: The movies schema.

31.1.2 Linking Tables

The easy method of linking tables is to simply include the tables in the query and have
a condition that equates their common fields. This may not be the most efficient means,
but it is a good place to start.
Query Q19 seeks the names of the movies starting with an actor’s aid. This requires
the use of the isin and movies table, where the common field is the mid value. The query
is shown in Code 31.1. Line 1 shows the values to be returned and the mid field now has
the table declaration. In this query there are two fields named mid and thus it is necessary
to declare which table is to be used for the return. The values are the same and so using
either movies.mid or isin.mid produces the same answer. The second field returned is
the movie title and there is no disambiguity as to which table this field resides.
Line 2 lists the two tables involved in this query separated by a comma. Multiple
tables can be listed, but in this case only two are needed. Line 3 links the two tables
together. This line indicates that the values in the movies.mid field are the same values
as in the isin.mid. Line 4 finishes the conditions of the query by indicating that the aid
= 281.
Query Q20 takes this concept one step further by starting with the actor’s name
instead of an aid value. According to the schema shown in Figure 31.1 it is necessary
to start with the actors table, progress through the isin table and finish with the movies
table. Thus, three tables are involved as shown in line 2 of Code 31.2. Line 3 connects
the movies table to the isin table and the isin table to the actors table. Line 4 finishes

486
Code 31.1 A query using two tables.
1 mysql> SELECT movies.mid, title
2 FROM movies, isin
3 WHERE movies.mid=isin.mid
4 AND isin.aid=281;
5 +-----+--------------------------+
6 | mid | name |
7 +-----+--------------------------+
8 | 44 | For the Love of the Game |
9 | 229 | 9 |
10 | 347 | Shadows and Fog |
11 | 554 | A Prairie Home Companion |
12 +-----+--------------------------+

the conditions.

31.1.3 Combined with Functions

The functions shown in previous queries are available in queries that use multiple tables.
Query Q21 requests the average grade for John Goodman movies. This is similar to the
previous query in that it is necessary to convert the actor’s name to an aid, convert that to
multiple mid values, and finally converting those to movie titles. The only read difference
is line 1 as shown in Code 31.3.
Query Q22 requests the movies that are in French. This requires the langs, inlang
and movies tables. Structurally, the query is similar to the previous and the query is
shown in Code 31.4.
Query Q23 seeks the languages of the Peter Falk movies. This query requires four
tables to travel from the actors table to the langs table. The query also requires that each
language be listed only once. The query is shown in Code 31.5 in which line 1 uses the
DISTINCT function to prevent multiple listings of any language. Line 2 lists the four
tables and lines 3 and 4 tie them together. Line 5 finishes the conditions of the query.

31.1.4 Using a Table Multiple Times

Query Q24 seeks the movies that have both Maggie Smith and Daniel Radcliffe. This is
an extension of a previous query that requested movies from a single actor, which itself
was an extension of Query Q19 that started with the aid and progressed to the movie
title using just two tables. The query path is diagrammed in Figure 31.2 which shows the
tables as ovals.
The issue with Query Q24 is that the same tables are used multiple times. The

487
Code 31.2 A query using three tables.
1 mysql> SELECT movies.mid, title
2 FROM movies, isin, actors
3 WHERE movies.mid=isin.mid AND isin.aid=actors.aid
4 AND actors.firstname=’John’ AND actors.lastname=’Goodman’;
5 +-----+-----------------------------------------------------+
6 | mid | name |
7 +-----+-----------------------------------------------------+
8 | 78 | Monsters Inc. |
9 | 95 | Raising Arizona |
10 | 88 | O Brother, Where Are Thou |
11 | 119 | The Big Lebowski |
12 | 278 | Revenge of the Nerds |
13 | 291 | The Flintstones |
14 | 435 | Marilyn Hotchkiss’ Ballroom Dancing & Charm School |
15 | 661 | Matinee |
16 | 682 | True Stories |
17 | 779 | The Artist |
18 +-----+-----------------------------------------------------+

Code 31.3 The average grade for John Goodman.


1 mysql> SELECT AVG(grade)
2 FROM movies,isin,actors
3 WHERE movies.mid=isin.mid AND isin.aid=actors.aid
4 AND actors.firstname=’John’ AND actors.lastname=’Goodman’;
5 +------------+
6 | AVG(grade) |
7 +------------+
8 | 6.9000 |
9 +------------+

Figure 31.2: A query involving two tables.

488
Code 31.4 Movies in French.
1 mysql> SELECT movies.mid,title
2 FROM movies, inlang, langs
3 WHERE movies.mid=inlang.mid AND inlang.lid=langs.lid
4 AND langs.language=’French’;
5 +-----+------------------------------------------+
6 | mid | name |
7 +-----+------------------------------------------+
8 | 14 | Blame it on Fidel |
9 | 54 | Hotel Rwanda |
10 | 60 | Jesus of Montreal |
11 | 80 | Munich |

Code 31.5 Languages of Peter Falk movies.


1 mysql> SELECT DISTINCT(language)
2 FROM langs,inlang,actors,isin
3 WHERE langs.lid=inlang.lid AND inlang.mid=isin.mid
4 AND isin.aid=actors.aid
5 AND firstname=’Peter’ AND lastname=’Falk’;
6 +----------+
7 | language |
8 +----------+
9 | English |
10 | German |
11 +----------+

489
actors and isin tables are used for the Maggie Smith component of the query and then
again for the Daniel Radcliffe component. Determining the mid values for Maggie Smith
is independent of the search for the mid values for Daniel Radcliffe. It is only after the
mid values for both actors have been collected that they are combined. Thus, the use of
the actors and isin tables for Maggie Smith are used independently of those used in the
Daniel Radcliffe. Basically, the query requires two distinct uses of the same tables.
Query Q25 extends this one step further as it searches for the other actors that are
in the same movies as Radcliffe and Smith. The mid values in common with these two
actors use isin and actors a third time to get names of other actors. This query uses these
two tables three independent times in the query. The flow of this query is shown in Figure
31.3.

Figure 31.3: Actors in movies with two named actors.

Multiple uses of the same tables is handled by renaming instances of the tables with
different labels. In Query Q24 the Daniel Radcliffe portion of the query, instances of the
isin and actors tables are renamed i1 and a1 respectively. Likewise, the Maggie Smith
portion of the query uses tables named i1 and a2.
This query is shown in Code 31.6. Line 2 creates the two instances of these tables
along with the movies table which is needed to retrieve the movie titles. Line 3 connects
the movies table to the two instances of the isin table. Line 4 connects the isin tables
to their respective actors tables. The last two lines create the condition for the actor’s
names.
In Q25, three instances of the actors table are used as shown in Figure 31.3. Small
numbers are placed next to the table names to indicate which instance is being used.
Numbers above the attribute names are used just for referencing here in the text. On
the left in circles 1 and 4 are the names of the two target actors. These are converted to
their aid numbers which are converted to their list of movies in circles 3 and 6. These are
combined with an intersection so that circle 7 is the list of mids in which had both actors.
Circle 8 contains the aids of all actors in those movies and their names are revealed in
circle 9.
Consider the transition from circle 1 to circle 2. In this step the name Daniel
Radcliffe is converted into an aid using the actors table. The query is shown in Code 31.7
and as seen his aid is 238. A similar query is performed from Maggie Smith to reveal that

490
Code 31.6 Movies common to Daniel Radcliffe and Maggie Smith.
1 mysql> SELECT movies.mid, title
2 FROM movies,isin AS i1, isin AS i2, actors AS a1, actors AS a2
3 WHERE movies.mid=i1.mid AND movies.mid=i2.mid
4 AND i1.aid=a1.aid AND i2.aid=a2.aid
5 AND a1.firstname=’Daniel’ AND a1.lastname=’Radcliffe’
6 AND a2.firstname=’Maggie’ AND a2.lastname=’Smith’;
7 +-----+------------------------------------------+
8 | mid | name |
9 +-----+------------------------------------------+
10 | 184 | Harry Potter and the Sorcerer’s Stone |
11 | 186 | Harry Potter and the Prisoner of Azkaban |
12 | 187 | Harry Potter and the Goblet of Fire |
13 +-----+------------------------------------------+

her aid is 237.

Code 31.7 Radcliffe’s aid.


1 mysql> SELECT aid FROM actors
2 WHERE firstname=’Daniel’ AND lastname=’Radcliffe’;
3 +-----+
4 | aid |
5 +-----+
6 | 238 |
7 +-----+
8 1 row in set (0.14 sec)

The next step is to use the isin table to convert the aid into a list of mids for the
movies that Radcliffe has been in. This requires the use of both the actors and the isin
tables and the query is shown in Code 31.8.
A similar query can be performed for Maggie Smith and the mids for both will be
combined in circle 7. In order for this to occur it will be necessary to perform two searches
on the actors and isin tables. These two individual searches are performed by renaming
each table twice with different names. First, consider the rename of the tables for just the
Radcliffe portion of the query which is shown in Code 31.9.
The small numbers in the rectangles in Figure 31.3 coincide with the renaming of
the tables. The rectangle that has ‘actors 1’ is a1 in the query.
The next step is to duplicate this query for Maggie Smith and using i2 and a2
instead of i1 and a1. These two queries must then be combined such that only those mids
that are in common survive. The query is shown in Code 31.10 with line 4 isolating the
common mids.

491
Code 31.8 Radcliffe’s mid.
1 mysql> SELECT mid FROM isin, actors
2 WHERE isin.aid=actors.aid
3 AND actors.firstname=’Daniel’ AND actors.lastname=’Radcliffe’;
4 +------+
5 | mid |
6 +------+
7 | 184 |
8 | 185 |
9 | 186 |
10 | 187 |
11 | 400 |
12 +------+
13 5 rows in set (0.57 sec)

Code 31.9 Radcliffe’s mid with renaming.


1 mysql> SELECT i1.mid FROM isin AS i1, actors AS a1
2 WHERE i1.aid=a1.aid
3 AND a1.firstname=’Daniel’ AND a1.lastname=’Radcliffe’;
4 +------+
5 | mid |
6 +------+
7 | 184 |
8 | 185 |
9 | 186 |
10 | 187 |
11 | 400 |
12 +------+
13 5 rows in set (0.06 sec)

492
Code 31.10 The mids with both Smith and Radcliffe.
1 mysql> SELECT i1.mid, i2.mid
2 FROM isin AS i1, actors AS a1, isin AS i2, actors as a2
3 WHERE i1.aid=a1.aid AND i2.aid=a2.aid
4 AND i1.mid=i2.mid
5 AND a1.firstname=’Daniel’ AND a1.lastname=’Radcliffe’
6 AND a2.firstname=’Maggie’ AND a2.lastname=’Smith’;
7 +------+------+
8 | mid | mid |
9 +------+------+
10 | 184 | 184 |
11 | 186 | 186 |
12 | 187 | 187 |
13 +------+------+
14 3 rows in set (1.47 sec)

Line 1 selects the mids from both actors and as shown in the output only one was
really necessary. Line 2 creates two names for each table with a1 and i1 being used for
Radcliffe and a2 and i2 being used for Smith. Line 3 connects the aid attribute for each
pair (a1,i1) and (a2,i2). Line 4 connects the two isin tables which will perform the
intersection necessary to get to circle 7. Lines 5 and 6 create the targets and the results
are shown starting in line 7. As seen there are 3 such movies.
The next step is to convert those mids to aids of all of the actors that are in those
movies. This will require a third query through the isin table. The query is shown in
Code 31.11 which will show the actor’s aid and the mid of the movies.
Line 2 adds the isin AS i3 component which will be used to convert mids to aids.
The linkage is made in line 3 which connects the mid of the third isin table with the mid
of the first isin table. In this case the second isin table could have been used instead of
the first. The rest of the query is the same. The results show the aid of the actor in one
of the three movies.
The only result that is needed is the aids of the actors and duplicates are not desired.
So, the query is modified slightly in Code 31.12 to extract just the aids and to use the
DISTINCT keyword to remove the duplicates. The results are single instances of the aids
of the actors that were in movies with Smith and Radcliffe. This is circle 8 in Figure 31.3.
The final step is easy and that is to convert the aids to names. However, this
requires another query through the actors table to convert their aids back to their names.
Query Q25 is completed in Code 31.13 Line 1 requests information from the third instance
of the actors table. Lines 2 and 3 define the instances of the tables, lines 4 through 6 tie
the tables together. Lines 7 and 8 set the search conditions and the results are shown
below.

493
Code 31.11 The aid of other actors.
1 mysql> SELECT i3.aid, i1.mid
2 FROM isin AS i1, actors AS a1, isin AS i2, actors as a2, isin AS i3
3 WHERE i3.mid=i1.mid
4 AND i1.aid=a1.aid AND i2.aid=a2.aid AND i1.mid=i2.mid
5 AND a1.firstname=’Daniel’ AND a1.lastname=’Radcliffe’
6 AND a2.firstname=’Maggie’ AND a2.lastname=’Smith’;
7 +------+------+
8 | aid | mid |
9 +------+------+
10 | 236 | 184 |
11 | 237 | 184 |
12 | 222 | 184 |
13 | 238 | 184 |
14 | 128 | 184 |
15 | 228 | 184 |
16 | 680 | 184 |
17 | 240 | 186 |
18 | 222 | 186 |
19 | 237 | 186 |
20 | 228 | 186 |
21 | 238 | 186 |
22 | 680 | 186 |
23 | 222 | 187 |
24 | 237 | 187 |
25 | 228 | 187 |
26 | 238 | 187 |
27 | 680 | 187 |
28 +------+------+
29 18 rows in set (1.40 sec)

494
Code 31.12 Unique actors.
1 mysql> SELECT DISTINCT(i3.aid)
2 FROM isin AS i1, actors AS a1, isin AS i2, actors as a2, isin AS i3
3 WHERE i1.aid=a1.aid AND i2.aid=a2.aid AND i1.mid=i2.mid AND i1.mid=i3.mid
4 AND a1.firstname=’Daniel’ AND a1.lastname=’Radcliffe’
5 AND a2.firstname=’Maggie’ AND a2.lastname=’Smith’;
6 +------+
7 | aid |
8 +------+
9 | 236 |
10 | 237 |
11 | 222 |
12 | 238 |
13 | 128 |
14 | 228 |
15 | 680 |
16 | 240 |
17 +------+
18 8 rows in set (1.59 sec)

Code 31.13 Actors common to movies with Daniel Radcliffe and Maggie Smith.
1 mysql> SELECT DISTINCT a3.firstname, a3.lastname
2 FROM movies, isin AS i1, isin AS i2, isin AS i3,
3 actors AS a1, actors AS a2, actors AS a3
4 WHERE i1.mid=movies.mid AND i1.aid=a1.aid
5 AND i2.mid=movies.mid AND i2.aid=a2.aid
6 AND i3.mid=movies.mid AND i3.aid=a3.aid
7 AND a1.firstname=’Daniel’ AND a1.lastname=’Radcliffe’
8 AND a2.firstname=’Maggie’ AND a2.lastname=’Smith’;
9 +-----------+-----------+
10 | firstname | lastname |
11 +-----------+-----------+
12 | Richard | Harris |
13 | Maggie | Smith |
14 | Robbie | Coltrane |
15 | Daniel | Radcliffe |
16 | Ed | Harris |
17 | Alan | Rickman |
18 | Emma | Thompson |
19 | Warwick | Davis |
20 +-----------+-----------+

495
31.2 Joining Tables

Linking tables can certainly be performed as shown in the previous section, but as the
queries become more complicated it is important to consider the efficiency of the query.
Just as in programming, if the query statement is poorly constructed then the search can
be very inefficient. The solution is to link the tables through a the JOIN command. There
are four main types of joins:

1. INNER JOIN or JOIN


2. LEFT JOIN
3. RIGHT JOIN
4. OUTER JOIN

The INNER JOIN is the same as JOIN and works similar to the commands in the
previous section. Code 31.14 shows the commands that use the JOIN-ON construct. The
first table is listed and then the second table is listed after ON with the attributes that are
linked.
Code 31.14 The mids for Cary Grant.
1 mysql> SELECT isin.mid
2 FROM isin JOIN actors ON isin.aid=actors.aid
3 WHERE firstname=’Cary’
4 AND lastname=’Grant’;
5 +------+
6 | mid |
7 +------+
8 | 83 |
9 | 267 |
10 | 297 |
11 | 298 |
12 | 343 |
13 | 387 |
14 | 267 |
15 | 432 |
16 +------+
17 8 rows in set (0.16 sec)

The inner join is shown pictorially in Figure 31.4 which shows data from two tables
A and B. The data that is returned is the shaded area. In the ongoing example these are
the actors and isin tables and the shaded area includes those entries that have the same
aid.
The database currently has three movies that have the substring ‘under’ in the title
as shown in Code 31.15. Consider another query shown in Code 31.16 which uses JOIN

496
Figure 31.4: Inner join.

and links the mids from the movies and isin tables.

Code 31.15 The titles with ‘under’.


1 mysql> SELECT title FROM movies
2 WHERE title LIKE ’%under%’;
3 +----------------------+
4 | title |
5 +----------------------+
6 | Tropic Thunder |
7 | Under the Tuscan Sun |
8 | Under the Bombs |
9 +----------------------+
10 3 rows in set (0.05 sec)

There are two major differences in the output. First, the movie Under the Bombs is
not listed, and second the other two movies are listed multiple times. The movie Tropic
Thunder is listed six times because there are six actors associated with this movie in the
isin table. Likewise, Under the Tuscan Sun has two entries because two actors are listed
in isin. Each returned tuple is unique because the isin.isid attribute is unique.

31.2.1 Left Join

A LEFT JOIN is shown pictorially in Figure 31.5 which shows that this query will include
items from table A and items that are in both A and B.

Figure 31.5: Left join.

The query is shown in Code 31.17 which replaces JOIN with LEFT JOIN. As seen
there is now a new entry for the movie Under the Bombs. It was excluded from Code

497
Code 31.16 Inner join with multiple returns.
1 mysql> SELECT m.mid, m.title, i.mid, isid FROM movies AS m
2 JOIN isin AS i ON m.mid=i.mid
3 WHERE m.title LIKE ’%Under%’;
4 +-----+----------------------+------+------+
5 | mid | title | mid | isid |
6 +-----+----------------------+------+------+
7 | 160 | Tropic Thunder | 160 | 285 |
8 | 160 | Tropic Thunder | 160 | 286 |
9 | 160 | Tropic Thunder | 160 | 287 |
10 | 160 | Tropic Thunder | 160 | 288 |
11 | 160 | Tropic Thunder | 160 | 289 |
12 | 160 | Tropic Thunder | 160 | 290 |
13 | 324 | Under the Tuscan Sun | 324 | 963 |
14 | 324 | Under the Tuscan Sun | 324 | 964 |
15 +-----+----------------------+------+------+
16 8 rows in set (0.03 sec)

31.16 because there are no actors for this movie listed in the table isin. However, in the
LEFT JOIN query the movie is included because it is in table A.

31.2.2 Right Join

The RIGHT JOIN has a similar concept as does the LEFT JOIN excepting which table is
fully included. The pictorial representation is shown in Figure 31.6. In this example the
RIGHT JOIN does not have a different output from the JOIN because every entry in the
isin table has an associated movie.

Figure 31.6: Right join.

31.2.3 Other Joins

An OUTER JOIN would include all entries from all tables even if there are entries in one
table that have no associations in the other table. The logic is shown in Figure 31.7.
Other types of joins are shown in the following images and codes.[Moffatt, 2009] The

498
Code 31.17 Left join with multiple returns.
1 mysql> SELECT m.mid, m.title, i.mid, isid FROM movies AS m
2 LEFT JOIN isin AS i ON m.mid=i.mid
3 WHERE m.title LIKE ’%Under%’;
4 +-----+----------------------+------+------+
5 | mid | title | mid | isid |
6 +-----+----------------------+------+------+
7 | 160 | Tropic Thunder | 160 | 285 |
8 | 160 | Tropic Thunder | 160 | 286 |
9 | 160 | Tropic Thunder | 160 | 287 |
10 | 160 | Tropic Thunder | 160 | 288 |
11 | 160 | Tropic Thunder | 160 | 289 |
12 | 160 | Tropic Thunder | 160 | 290 |
13 | 324 | Under the Tuscan Sun | 324 | 963 |
14 | 324 | Under the Tuscan Sun | 324 | 964 |
15 | 491 | Under the Bombs | NULL | NULL |
16 +-----+----------------------+------+------+
17 9 rows in set (0.06 sec)

Figure 31.7: Outer join.

499
left excluding join includes those items in A but not in B and the code is shown in Code
31.18.

(a) Left excluding join (b) Right excluding join (c) Outer excluding join

Figure 31.8: Other joins.

Code 31.18 Left excluding joins.[Moffatt, 2009]


1 SELECT <select_list>
2 FROM Table_A A
3 LEFT JOIN Table_B B
4 ON A.Key = B.Key
5 WHERE B.Key IS NULL

Query Q26 is to return a list of the movies with each actor’s aid. The query is
shown in Code 31.19 using the RIGHT JOIN.

31.2.4 Functional Dependencies

A functional dependency occurs when tuples contain elements that agree with other tuples.
Consider a case in which a table has several columns C1 to CN , and in this case some of
the elements agree across multiple tuples. For example, in some rows of the table there
are cases in which the first three columns agree. That means if R1 contains c1 , c2 , and c3
as values for the first three columns and if R2 has the same values then they agree.
A functional dependency occurs if for the same values of c1 and c2 there is only one
possible c3 . Basically, the value of the third column can be predicted by the values in the
first two columns. The keys of relation are dependent columns. In the example these are
the first two columns. A functional dependency for this case is written as C1 , C2 → C3 .
It then follows that if A → B and B → C then A → C.

31.3 Subqueries

A subquery is a query nested within another query. Much like nested for loops in Python,
the improper use of subqueries can lead to processes that consume far too much time and
resources. Subqueries should be used with care and only when necessary.

500
Code 31.19 The movie listed with each actor.
1 mysql> SELECT m.mid, m.title, i.aid
2 FROM movies AS m
3 RIGHT JOIN isin AS i ON m.mid=i.mid
4 WHERE m.title LIKE ’%Under%’;
5 +------+----------------------+------+
6 | mid | name | aid |
7 +------+----------------------+------+
8 | 160 | Tropic Thunder | 88 |
9 | 160 | Tropic Thunder | 94 |
10 | 160 | Tropic Thunder | 196 |
11 | 160 | Tropic Thunder | 197 |
12 | 160 | Tropic Thunder | 27 |
13 | 160 | Tropic Thunder | 164 |
14 | 324 | Under the Tuscan Sun | 466 |
15 | 324 | Under the Tuscan Sun | 479 |
16 | 160 | Tropic Thunder | 734 |
17 | 776 | Undertaking Betty | 270 |
18 | 776 | Undertaking Betty | 748 |
19 +------+----------------------+------+

Query Q19 sought the name of a movie for an actor with a given aid. Code 31.20
shows the same query but with the use of a subquery. Line 2 contains the subquery within
parenthesis which returns the mid values for a given actor. This returns multiple values
which are then used in the primary query.
When the results from a subquery are being used in a condition it is necessary to
assign an alias to the subquery. This subquery is in line 2 and renamed t, which is then
used in line 3 in a condition.
Efficient use of subqueries is a bit tricky to accomplish in complicated queries. The
user is highly encouraged to test each subquery to ensure that the response that they
expect is indeed the response that is returned.

31.4 Combinations

Queries Q27 and Q28 use mulitple devices to retrieve the correct results. Q27 seeks the 5
actors that have been in the most movies. This is a simple enough query to understand
but complicated to achieve. It is necessary to count the number of movies that all actors
have been in before it is possible to find the top 5.
The query is shown in Code 31.22 where line 1 shows the items to be retrieved which
come from multiple tables. The last item is the count of movies which is assigned an alias

501
Code 31.20 The use of a subquery.
1 mysql> SELECT title FROM movies WHERE mid IN
2 (SELECT mid FROM isin WHERE aid=12);
3 +-------------------------------------+
4 | name |
5 +-------------------------------------+
6 | Back to the Future |
7 | Interstate 60: Episodes of the Road |
8 | Twenty Bucks |
9 | Who Framed Roger Rabbit |
10 | Addam’s Family Values |
11 | The Addams Family |
12 | The Dream Team |
13 | To Be or Not to Be |
14 | My Favorite Martian |
15 +-------------------------------------+

Code 31.21 Assigning an alias to a subquery.


1 mysql> SELECT * FROM
2 (SELECT year, AVG(grade) FROM movies GROUP BY year) AS t
3 WHERE t.year BETWEEN 1950 AND 1959;
4 +------+------------+
5 | year | AVG(grade) |
6 +------+------------+
7 | 1950 | 5.0000 |
8 | 1951 | 7.0000 |
9 | 1953 | 3.0000 |
10 | 1954 | 8.5000 |
11 | 1955 | 6.0000 |
12 | 1956 | 6.6667 |
13 | 1957 | 9.5000 |
14 | 1958 | 7.0000 |
15 | 1959 | 7.5000 |
16 +------+------------+

502
because this count is used later in the query. Line 2 lists the two tables and connects
them. Line 3 uses the GROUP BY function to collect the counts for each actor. Line 4
then orders the returned data and uses LIMIT to print out just the top five.

Code 31.22 The top 5 actors in terms of number of appearances.


1 mysql> SELECT actors.aid, firstname,lastname, COUNT(mid) AS c
2 FROM actors, isin WHERE actors.aid=isin.aid
3 GROUP BY aid,firstname,lastname
4 ORDER BY c DESC LIMIT 5;
5 +-----+-----------+-----------+----+
6 | aid | firstname | lastname | c |
7 +-----+-----------+-----------+----+
8 | 530 | Alfred | Hitchcock | 23 |
9 | 56 | Woody | Allen | 17 |
10 | 26 | Dan | Aykroyd | 15 |
11 | 122 | Steve | Martin | 15 |
12 | 9 | Tom | Hanks | 13 |
13 +-----+-----------+-----------+----+

Query Q28 seeks the actors with the best average score with the condition that the
actors have been in at least five movies. Code 31.23 shows the query where once again
the return shows data from multiple tables including one with the average function. Lines
2 and 3 declares which tables are used and how they are connected. Line 4 groups the
data but adds the condition that the number of movies must exceed five. Line 5 orders
the return and limits the results.

Code 31.23 The actors with the best average scores.


1 mysql> SELECT actors.aid, firstname,lastname, AVG(grade) AS g
2 FROM actors, isin, movies
3 WHERE actors.aid=isin.aid AND movies.mid=isin.mid
4 GROUP BY actors.aid,firstname,lastname HAVING COUNT(movies.mid)>5
5 ORDER BY g DESC LIMIT 5;
6 +-----+---------------+----------+--------+
7 | aid | firstname | lastname | g |
8 +-----+---------------+----------+--------+
9 | 643 | Edward G. | Robinson | 8.0000 |
10 | 6 | James (Jimmy) | Stewart | 7.8571 |
11 | 219 | Judi | Dench | 7.6667 |
12 | 135 | Danny | DeVito | 7.4286 |
13 | 338 | Jack | Warden | 7.1667 |
14 +-----+---------------+----------+--------+

503
31.5 Summary

The real power of database searches is the ability to combine information from multiple
tables. This may be a simple trace through a schema or a query that involves multiple
instances of tables or subqueries. This chapter displayed single queries that retrieved data
that was difficult to retrieve using a spreadsheet.

Problems

1. Write a single MySQL command that returns an alphabetical list of all of the movies
from 1985.
2. Retrieve the years of the movies starring Cary Grant.
3. Write a single MySQL command that returns the number of movies that are in
English.
4. Write a single MySQL command to return the average grade of movies in Spanish.
5. What is the average grade for movies with Elijah Wood?
6. In a single command, return the averages for movies with either Elijah Wood or
John Goodman.
7. How many movies was Dan Aykroyd in?
8. Write a single MySQL command that returns the first and last names of the actors
that are in the Harry Potter movies. This list should be alphabetically ordered by
last name and have no duplicates.
9. Write a single MySQL command to determine the name of the actor that was in the
most movies?
10. Write a single command to determine if Peter Falk was in a movie with a language
other than English.
11. In a single MySQL command return the names of the countries of the movies that
starred Pete Postlethwaite. The answer should have no duplicates.
12. Write a single MySQL command that displays the year and average grade for the
year with the highest average grade and at least 7 movies.
13. Write a single MySQL command that returns the names of the movies that have
both Steve Martin and Humphrey Bogart.
14. List in the names of the actors that were in movies with a grade of 9 or better.
This list should be alphabetical according to the last name of the actor and no actor
should be listed more than once.

504
Chapter 32

Connecting Python with MySQL

MySQL has the ability to search data and even have functions that iteratively consider
the data. However, languages such as Python are far more powerful in data analysis than
MySQL. Thus, it is prudent to connect the two systems together. In this manner Python
scripts can use MySQL to sift through the data stored in a database and then perform
complicate analysis on that data. It also allows a program to send several queries to the
database in an effort to obtain the desired information.

32.1 Connecting Python with MySQL

There are three basic steps in the process. The first is to connect to the database, the
second is to deliver a query statement to the database, and the third is to receive any data
that the database is returning. This section will review all three processes.

32.1.1 Making the Connection

There are several third party tools that can be used to connect Python to MySQL. This is
the case with any language actually. Programmers in Java and C++ also need to import
a tool that makes this connection.
The popular tool for Python 2.7 users has been mysqldb which (at the time of writing
this chapter) is not available for Python 3.x. The popular tool for uses of Python 3.x is
pymysql. This is included in packages such as Anaconda. The import statement is shown
in line 1 of Code 32.1.
There are four possible pieces of information that are needed to connect to the
database. These are the name of the host machine if different than the user’s machine,
the name of the database, the name of the MySQL user, and the user’s MySQL password.
These are established as strings of which the first is shown in line 2 and the creation of
the others is assumed in line 3. Line 4 makes the connection to the database using these

505
four variables. Note that the variable for the password is passwd ad password is a Python
keyword. The final step is to define the cursor which is the avenue by which Python will
communicate with MySQL. This is performed in line 5. Finally, line 6 can be used to close
the connection.

Code 32.1 Creating the connection.

1 >>> import pymysql


2 >>> server = ' host . gmu . edu '
3 ...
4 >>> conn = mypysql . connect ( server , user , passwd , db )
5 >>> cursor = conn . cursor ()
6 >>> conn . close ()

Now, the connection is made and the two systems are ready to communicate. The
next step is to send a MySQL command and receive the data.

32.1.2 Queries from Python

Sending a query to the database and receiving the responses is quite easy. The pymysql
module has functions other than the ones that will be shown in this section, but the ones
shown here are sufficient for many applications.
The process of sending a query is to create a string in Python that is the desired
MySQL command without the semicolon, and then to send that string via the cursor that
was created in line 5 of Code 32.1. Line 2 in Code 32.2 creates a string and line sends it to
the database using the execute command. The value of n is the number of lines returned
by the query. There is a similar command named executemany which will be shown in
Code 32.3.

Code 32.2 Sending a query and retrieving a response.

1 >>> act = ' SELECT * FROM movies '


2 >>> n = cursor . execute ( act )
3

4 >>> answ = cursor . fetchone ()


5 >>> answ = cursor . fetchall ( )
6 >>> answ = cursor . fetchmany ( n )

There are three common methods in which the data can be retrieved by Python. In
all cases each line of data is stored as a tuple even if the data returned contains only a single
value. It will be a single value inside of a tuple. Line 4 uses the fetchone command to
retrieve one line of the MySQL return. Repeated uses of fetchone will retrieve consecutive
lines in the return. In a sense this command is all that is required, however, there are two

506
other commands that can provide convenience. Line 5 shows the fetchall that retrieves
all of the lines from the MySQL query into a tuple. Each line is also a tuple and so the
return from this command is a tuple that contains tuples. The fetchmany function is
similar except that the users specifies that only n lines are returned.
The variable answ from line 4 is a tuple. The number of items in the tuple is the
number of items that are returned from the query. From this point forward, the user
employs Python scripts to extract the data from the tuple and to further process the
information.

32.1.3 Altering the Database

There are MySQL commands to alter the content or tables of a database. These, too,
can be managed through the Python interface. However, there is a small commitment
that the user must enforce in order for the changes to become permanent. Consider Code
32.3 which uses the execmany command to upload three changes to the database in
lines 1 through 5. If the user were to query the database they would see these changes.
However, if the user were to log out then the changes would be destroyed. Line 6 shows
the commit function that uses the connection created in line 3 of Code 32.1. This will
make the changes permanent.

Code 32.3 Committing changes.

1 >>> cursor . executemany (


2 " INSERT INTO persons VALUES (% d , %s , % s ) " ,
3 [(1 , ' John Smith ' , ' John Doe ' ) ,
4 (2 , ' Jane Doe ' , ' Joe Dog ' ) ,
5 (3 , ' Mike T . ' , ' Sarah H . ' ) ])
6 >>> conn . commit ()

32.1.4 Multiple Queries

Once the cursor is created several queries can be sent to the database as shown in Code
32.4. The cursor does not have to be reestablished after every query. It is important to
note that care should be exercised with multiple queries. It is possible that users will send
a large number of small queries to the database. If this database is one a server that is a
far distance from the user then there is a time cost to receiving the data. Thus, a large
number of small queries can be a recipe for a slow running program. Likewise, a full table
dump can also be expensive as a large amount of data must travel across the network.
The rule of thumb is to minimize the number of queries as well as minimizing the
amount of data to be retrieved. So, the user should attempt to perform as much pruning as
possible with the MySQL command. If the DBMS is on the same computer as the Python

507
Code 32.4 Sending multiple queries.

1 >>> conn = mypysql . connect ( server , user , passwd , db )


2 >>> cursor = conn . cursor ()
3 >>> act = ' SELECT * FROM movies WHERE mid =200 '
4 >>> n = cursor . execute ( act )
5 >>> answ = cursor . fetchone ()
6 >>> act = ' SELECT * FROM movies WHERE mid =201 '
7 >>> n = cursor . execute ( act )
8 >>> answ = cursor . fetchone ()

scripts then this issue slackens and the time required to retrieve the data is significantly
less.
Query Q29 is to compute the average grade for each decade. Creating a MySQL
command to compute the average grade of the movies in a single decade is not difficult.
The plan then is to create a Python loop that creates this command for each decade.
The command for one decade is shown in Code 32.5 where act is the string to be sent to
MySQL. The condition that grade>0 excludes those movies with an invalid grade.

Code 32.5 Sending multiple queries.

1 >>> act = ' SELECT AVG ( grade ) FROM movies WHERE year BETWEEN
1920 AND 1929 AND grade >0 '
2 >>> cursor . execute ( act )
3 1
4 >>> float ( cursor . fetchone () [0])
5 7.5

Code 32.6 shows a solution to Q29. The for loop iterates through each decade. The
string act is similar to the previous except that the years change. Each query is then sent
to the database and the answer is received and printed.

32.2 The Kevin Bacon Effect

The Kevin Bacon effect was discussed in Section 28.3. Basically, actors are in movies with
other actors and the purpose is to find the connection from a given actor to Kevin Bacon.
For example, the path from Johnny Depp to Kevin Bacon follows this path. Johnny
Depp and Dianne Wiest were in Edward Scissorhands, Wiest and Steve Martin were in
Parenthood, and Martin and Kevin Bacon were in Planes, Trains & Automobiles. In this
database, this is the shortest path from Depp to Bacon.
Computing the shortest path is performed by the Floyd-Warshall algorithm pre-

508
Code 32.6 Sending multiple queries.

1 >>> for i in range ( 1920 , 2010 , 10 ) :


2 act = ' SELECT AVG ( grade ) FROM movies WHERE year BETWEEN '
3 act += str ( i ) + ' AND ' + str ( i +9)
4 act += ' AND grade > 0 '
5 n = cursor . execute ( act )
6 f = float ( cursor . fetchone () [0])
7 print (i , f )
8

9 1920 7.5
10 1930 7.8
11 1940 6.64
12 1950 6.6667
13 1960 6.0333
14 1970 5.9138
15 1980 5.5575
16 1990 5.9506
17 2000 6.0314

sented in Section 26.3.2. In this case all of the actor data will be needed so the entire
table is downloaded and parsed. The process begins in Code 32.7 with two functions.
The first is Connect which receives the MySQL host computer URL, the name of the
database, the MySQL user name and password. It returns the connection to the database,
db, and the cursor, c. The second function is DumpActors which returns all of the actors
names in a dictionary where the key is the actor’s aid. This dictionary is returned by the
function in line 18.
The second step is to create the connected graph by the function MakeG shown in
Code 32.8. The result is a matrix G which is a binary valued matrix. The i-th row and the
i-column corresponds to the i-th actor. It should be noted that first index in the matrix
is 0 and the first aid is 1, thus row index and aid currently differ by a value of 1. This
will change. The item G[i, j] is set to 1 if the actors corresponding to row i and the actor
corresponding to column j were in the same movie.
The third step is to apply the Floyd-Warshall algorithm as seen in Code 32.9. The
function RunFloyd calls the FastFloydP function which returns two matrices that are
used to define the shortest path between any two entities and the distance of that path.
There are actors in this database that can not be connected to Kevin Bacon. Basi-
cally, there are several disconnected graphs. There is one large graph that contains most
of the actors and then a few small graphs that tend to be actors in movies outside of the
USA that just have not been connected to the big graph. These spurious actors need to
be removed from the matrices in order to proceed.

509
Code 32.7 The DumpActors function.

1 # bacon . py
2 def Connect (h ,d ,u , p ) :
3 db = pymysql . connect ( host =h , user =u , db =d , passwd = p )
4 c = db . cursor ()
5 return db , c
6

7 def DumpActors ( c ) :
8 act = ' SELECT * FROM actors '
9 c . execute ( act )
10 dump = c . fetchall ()
11 actors = {}
12 for i in dump :
13 aid , fname , lname = i
14 actors [ aid ] = fname , lname
15 return actors
16

17 >>> db , c = bacon . Connect (h ,d ,u , p )


18 >>> actors = bacon . DumpActors ( c )

Identifying these actors is quite easy. The first actor in the database is Leonardo
DiCaprio which is an actor that belongs to the large graph. The values of the matrix f
from the RunFloyd function indicate the distance of the shortest path. The first row of
this matrix has several values of 7 or less which indicates that DiCaprio is connected to
the actors. Since most of the values are small it is easy to conclude that DiCaprio belongs
to the one large graph. There are a few cells that have the superficially large value of
9999999 corresponds to actors that are not connected to DiCaprio and therefore do not
belong to the large graph. These are the actors that need to be removed from further
consideration. This is performed by the RemoveBadNoP function shown in Code 32.10.
The inputs are the matrices f, G and p. The akeys is a list of the keys from the actors
dictionary (see line 24) and row is the index of the row to be used as the anchor. In this
case DiCaprio was the anchor and as he is the first actor row = 0.
This function creates a new G matrix, named G1, which is the graph without the
unconnected actors. So, this matrix is a bit smaller than the original. The function also
returns the matrix p.
Since some of the actors have been removed it is necessary to execute the RunFloyd
function again as shown in Code 32.11. Line 2 finds the shortest path between entities 8
and 10. These are the rows in the matrices and are not the aid of the actors. Originally,
the row index and actor aid were offset by a value of 1. However, since actors have
been removed even this guide is no longer valid. As seen in line 5 of Code 32.11 the row
412 corresponds to actor with aid = 421. The values returned by FindPath shows that

510
Code 32.8 The MakeG function.
1 # bacon . py
2 def MakeG ( c , actors ) :
3 NA = len ( actors )
4 G = np . zeros ( ( NA , NA ) )
5 keys = list ( actors . keys () )
6 for i in range ( NA ) :
7 fname , lname = actors [ keys [1]]
8 act = ' SELECT DISTINCT a2 . aid FROM isin AS i1 , isin
AS i2 , '
9 act += ' actors AS a2 , actors AS a1 WHERE i1 . aid = a2 .
aid '
10 act += " AND i2 . mid = i1 . mid AND a1 . aid = i2 . aid AND a1 .
aid = "
11 act += str ( i +1)
12 c . execute ( act )
13 ans = c . fetchall ()
14 N = len ( ans )
15 for j in range ( N ) :
16 col = int ( ans [ j ][0] )
17 G [i , col-1] = 1
18 return G
19

20 >>> G = bacon . MakeG ( c , actors )

Code 32.9 .
1 # bacon . py
2 def RunFloyd ( G ) :
3 GG = np . zeros ( G . shape )
4 GG = G + (1-G ) *9999999
5 ndx = np . indices ( GG . shape )
6 pp = G * ndx [0]
7 f , p = floyd . FastFloydP ( GG , pp )
8 return f , p
9

10 >>> f , p = bacon . RunFloyd ( G )


11 >>> import scipy . misc as sm
12 >>> sm . imshow ( G )

511
Code 32.10 The RemoveBadNoP function.
1 # bacon . py
2 def RemoveBad ( f , G , p , akeys , row ) :
3 hits = ( f [ row ] >999999) . nonzero () [0] # columns of those to
remove
4 hits . sort ()
5 hits = hits [::-1]
6 for i in hits :
7 print ( i )
8 N = len ( G )
9 newG = np . zeros (( N-1 , N-1) )
10 newp = np . zeros (( N-1 , N-1) )
11 newG [: i ,: i ] = G [: i ,: i ] + 0
12 newG [: i , i :] = G [: i , i +1:] + 0
13 newG [ i : ,: i ] = G [ i +1: ,: i ] + 0
14 newG [ i : , i :] = G [ i +1: , i +1:] + 0
15 newp [: i ,: i ] = p [: i ,: i ] + 0
16 newp [: i , i :] = p [: i , i +1:] + 0
17 newp [ i : ,: i ] = p [ i +1: ,: i ] + 0
18 newp [ i : , i :] = p [ i +1: , i +1:] + 0
19 a = akeys . pop ( i ) # remove this actor
20 G = newG + 0
21 p = newp + 0
22 return G , p
23

24 >>> akeys = list ( actors . keys () )


25 >>> G1 , p = bacon . RemoveBad (f , G , p , akeys , 0 )

512
the path starts with entity 8, to entity 412 and ends with entity 10. Using akeys it is
determined that the corresponding aid values are 9, 421, and 11. These correspond to
actors Tom Hanks, Martin Sheen and Michael J. Fox. This is the shortest path between
Hanks and Fox.

Code 32.11 The path from Hanks to Sheen.

1 >>> f1 , p1 = bacon . RunFloyd ( G1 )


2 >>> tpath = bacon . floyd . FindPath ( p1 , 8 ,10)
3 >>> tpath
4 [8 , 412 , 10]
5 >>> akeys [412]
6 421

Using this method it is possible to discover the shortest path between any two actors.
The shortest path between any two actors is the geodesic distance. The goal of Query
30 is to find the longest geodesic distance. This is a pair of actors who have a very long
shortest distance. Again the information is readily available since the geodesic distances
are in matrix f1. The location of the maximum values indicates which two actors are at
each end of this path.
There may be several pairs of actors that have the same geodesic distance. The
function Trace in Code 32.12 finds one of those pairs and prints out the actor’s names
and movies. In order to get this information it is necessary to send several commands to
the database. This is the string act which is inside of the for loop. The result allows the
user to find the path that is the longest geodesic distance between two actors. In this case
the actor path is: Arliss Howard, Debra Winger, Nick Nolte, Jack Black, Pierce Brosnan,
Robbie Coltrane, Shirley Henderson and Mads Mikkelson. The longest geodesic path is
between the actors Arliss Howard and Mads Mikkelsen. There are other pairs of actors
with the same geodesic length.

32.3 Problems

1. Connect Python to MySQL and perform Query Q1

2. Repeat problem 1 for any of the queries in the list in Section 28.2.

3. The path in Code 32.11 indicated which actors were in the trace but not their com-
mon movies. Write a Python script that accesses the database to find the common
movies for the actors in this list.

4. How many unique pairs of actors have the longest geodesic distance?

513
Code 32.12 The Trace function.
1 # bacon . py
2 def Trace ( f1 , p1 , akeys , actors , c ) :
3 N = len ( f1 )
4 V , H = divmod ( f1 . argmax () , N )
5 print ( ' Max ' , V ,H , f1 [V , H ])
6 tpath = floyd . FindPath ( p1 , V , H )
7 aid = np . array ( tpath ) . astype ( int )
8 for i in aid :
9 ii = akeys [ i ]
10 print ( ' Actor : ' , ii , actors [ ii ] )
11 act = ' SELECT m . name FROM movies AS m , isin WHERE
isin . mid = m . mid '
12 act += ' AND isin . aid = ' + str ( ii )
13 c . execute ( act )
14 ans = c . fetchall ()
15 for k in ans :
16 print ( ' \ t ' , k [0])

514
Bibliography

[MyS, ] Accessed June 2015, MySQL 12.9.1 Natural Languag Full-Text Searches.

[NC0, 2011] (2011). http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=


nuccore&id=50428312.

[ABI, 2016] ABI (2016 (accessed August 2016)). http://www6.appliedbiosystems.com/


support/software_community/ABIF_File_Format.pdf.

[Bscan, 2013] Bscan (2013). https://en.wikipedia.org/wiki/File:


MultivariateNormal.png.

[Cormen et al., 2000] Cormen, T. H., Leierson, C. E., and Rivest, R. L. (2000). Introduc-
tion to Algorithms. ”MIT Press.

[Corp., 2016] Corp., M. (2016 (accessed 12 Dec 2016)). Access


2016 Specifications. https://support.office.com/en-us/article/
Access-2016-specifications-0cf3c66f-9cf2-4e32-9568-98c1025bb47c?ui=
en-US&rs=en-US&ad=US.

[Fenniak, 2011] Fenniak, M. (2011 (accessed 27 Jan 2011)). pyPdf. http://pybrary.net/


pyPdf/.

[Gnu, 2016] Gnu (2016). https://en.wikipedia.org/wiki/File:Multivariate_


Gaussian.png.

[Kanaya et al., 2001] Kanaya, S., Kinouchi, M., Abi, T., Kudo, Y., Yamada, Y., Nishi,
T., Mori, H., and Ikemura, T. (2001). Analysis of codon usage diversity of bacterial
genes with a self-organizing map (som): Characterization of horizontally transferred
genes with emphasis on the e. coli o157 genome. Gene, 276:89–99.

[Kernler, 2014] Kernler, D. (2014). https://en.wikipedia.org/wiki/File:Empirical_


Rule.PNG.

[Moffatt, 2009] Moffatt, C. L. (2009 (accessed Feb 2009)).

[Porter, 2011] Porter, M. (2011 (accessed 27 Jan 2011)). Porter Stemming Algorithm.
http://www.tartarus.org/~martin/PorterStemmer.

515
[wikipedia, 2016] wikipedia (2016 (accessed 25 Aug 2016)). Paris Japonica. https://en.
wikipedia.org/wiki/Paris_japonica.

516
Index

k-means, 383, 386, 387, 389–391, 399, 401 alignment


LATEX, 22 local, 294
getitem , 179 AllDcts, 411
init , 177 AllDistances, 251
setitem , 179 AllFiles, 203
str , 179 AllWordDict, 406
AlphaAnn, 310
abigel alphabet, 274
Driver, 139 ALTER, 456
ReadData, 138 amino acid, 210, 313
ReadPBAS, 138 AND, 464
ReadRecord, 136 area, 9
SaveData, 139 argmax, 149, 288
ABS, 464 argmin, 149
absolute reference, 36 argsort, 149, 356, 384
Access, 439 argument, 161
ACOS, 464 array, 141
acos, 78 array, 142
acosh, 78 arrow, 288
Add2Contig, 339 arrow matrix, 283
ADDDATE, 476 AS, 468, 474
ADDTIME, 476 ASC, 473
adenine, 210 ASCII, 133, 406, 469
algebra, 5 ascii lowercase, 324
aligngreedy ASIN, 464
Add2Contig, 339 asin, 78
Assemble, 343 asinh, 78
AssembleML, 343 ASN.1, 227, 235, 236, 239
ChopSeq, 333 Assemble, 343
Finder, 339 AssembleML, 343
JoinContigs, 342 assembly, 337, 346
NewContig, 337 AssignMembership, 388, 396
ShiftedSeqs, 337 ATAN, 464
ShowContigs, 337 atan2, 394

517
atan, 78 BruteForceSlide, 281
ATAN2, 464 BruteForceSlide, 280, 336
atan2, 78, 394 byte, 131
atanh, 78
atomicity, 422 Candlesticks, 269
auto-correlation, 184, 186 CASE, 478, 479
AUTO INCREMENT, 454 CAST, 462, 478
AVERAGE, 39, 428, 429 CatSeq, 348
AVG, 466, 468 CData, 383
CDS, 231
Backtrace, 287 CEIL, 464
bacon CEILING, 464
Connect, 509 cell, 209
DumpActors, 509 CHAR, 460, 470
MakeG, 509 CHAR LENGTH, 469
RemoveBadNoP, 510 CheapClustering, 385
RunFloyd, 509 CheckForStartsStops, 216
Trace, 513 child node, 359
Base, 445 ChopSeq, 333
bell curve, 47, 48, 186, 370 chr, 133, 406
BestPairs, 345 chromosome, 209
BETWEEN...AND, 464 circle, 10
bibtex, 29 area, 10
big endian, 132 cite, 29
BIGINT, 456 CiteSeer, 405
BIN, 470 citeseer, 405
BINARY, 461, 462, 478 class, 173
binary tree, 359, 364 client, 439
BIT, 460 ClusterAverage, 389, 396
bit, 131 clustering, 383
BIT AND, 466 CData, 383
BIT LENGTH, 469 CheapClustering, 385
BIT OR, 466 CompareVecs, 384
BIT XOR, 466 ClusterVar, 387
BLAST, 283 Coding, 223
BLOB, 461 codon, 210, 244, 265
BLOSUM, 277, 283, 297, 299 codon frequency, 244
blosum codonfreq
BlosumScore, 312 Candlesticks, 269
BLOSUM50, 278 CodonFreqs, 267
BlosumScore, 280 CodonTable, 265
BlosumScore, 279, 298, 299, 312 CountCodons, 266
break, 100 GenomeCodonFreqs, 268
Brodatz, 252 CodonFreqs, 267

518
CodonTable, 265 volume, 11
coefficient, 7 CURDATE, 476, 477
CompareVecs, 384 CURRENT DATE, 476
Complement, 216, 234 CURRENT TIME, 476
complement, 89, 331 CURRENT TIMESTAMP, 476
complex, 75 CURTIME, 476
CONCAT, 470 cylinder, 11
CONCAT WS, 470 volume, 11
concurrent access, 422 cytoplasm, 209
Connect, 509 cytosine, 210, 219
consensus string, 312
ConsensusCol, 346 data isolation, 422
constituents, 383 data redundancy, 421
constructor, 177 DATE, 460, 476
contig, 332, 337, 346 DATE ADD, 476
continue, 100 DATE FORMAT, 476
CONVERT, 457, 462, 478 DATE SUB, 476
Convert, 367 DATEDIFF, 476
CONVERT TZ, 476 DATETIME, 460
copy, 324 DAY, 476
correlate, 184 DAYNAME, 476
COS, 464 DAYOFMONTH, 476
cos, 78 DAYOFWEEK, 476
cosh, 78 DAYOFYEAR, 476
cost function, 302, 310, 319, 325 DBMS, 421
CostAllGenes, 348, 350 decidetree
CostFunction, 302, 312, 319, 325 FakeDtreeData, 374
COT, 464 ScoreParam, 375
COUNT, 466, 467, 476 DECIMAL, 459
count, 88 decimal, 458
CountCodons, 266 decision tree, 369, 371, 374
COUNTIF, 41 DecoderDict, 237
cov, 243 def, 159
covariance, 243, 247 default argument, 162
covariance matrix, 192, 242 default value, 460
CRC32, 464 DEGREES, 464
CREATE DATABASE, 453 degrees, 105
CREATE TABLE, 454 degrees, 78
CreateIlist, 292 deoxyribonucleic acid, 209
cross product, 16, 145 dependent variable, 6, 7
cross references, 25 DESC, 473
CrossOver, 320, 326 DESCRIBE, 454
CSV, 123 dictionary, 81, 406
cube, 6, 11 dimredux

519
AllDistances, 251 EXP, 464
PCA, 250 EXPORT SET, 470
Project, 250 EXTRACT, 476
DISTINCT, 466, 467, 487, 493
DIV, 463 FakeDtreeData, 374
divmod, 150, 337 FASTA, 227
DNA, 209 FastFloyd, 395
DNAFromASN1, 237 FastFloydP, 509
dot, 146 FastMat, 336
dot product, 16, 145, 319 FastNW, 292
double helix, 209 FastSubValues, 288
DriveGA, 322 FastSW, 294
Driver, 139 fetchall, 507
DriveSortGA, 329 fetchmany, 507
DROP TABLE, 454 fetchone, 506
dump, 114 FIELD, 469
DumpActors, 509 fields, 437
dynamic programming, 283 file, 111
dynprog file pointer, 111
Backtrace, 287 filter, 427
CreateIlist, 292 FIND, 431
FastNW, 292 find, 88, 175
FastSubValues, 288 FIND IN SET, 470
FastSW, 294 Finder, 339
ScoringMatrix, 286 FindKeywordLocs, 232
SWBacktrace, 294 FindKeywordLocs, 232
dynprog, 286 FiveLetterDict, 408
FLOAT, 459
eig, 246 float, 75
eigenvalue, 245, 247 floating point, 458
eigenvector, 245, 247 FLOOR, 464
elif, 98 Floyd-Warshall, 395, 508, 509
ELSE, 479 for, 99
else, 96 FORMAT, 469
ELT, 470 from import, 169
ENUM, 461, 462 FROM DAYS, 476
enumerate, 102 FROM UNIXTIME, 477
Excel, 33 FULLTEXT, 480
exec, 171 function, 159
execfile, 171 functional dependency, 500
execmany, 507
execute, 506 GA, 317
executemany, 506 ga
exons, 211 CostFunction, 319

520
CrossOver, 320, 326 GnuPlotFiles, 394
DriveGA, 322 Save, 137, 201
Mutation, 321 GnuPlot, 137, 269, 385
gap, 283 GnuPlotFiles, 390, 394
gasort GoodWords, 412
CostFunction, 325 GoPolar, 393
DriveSortGA, 329 GRANT, 457
Jumble, 325 greedy algorithm, 331, 346, 385, 390
Legalize, 326 GROUP BY, 475, 503
Mutate, 328 GROUP BY ... HAVING, 475
gaunine, 210, 219 GROUP CONCAT, 466
Gaussian, 47, 48
Gaussian distribution, 186, 224, 370 hash table, 81
GC content, 219 helix, 209
GCcontent, 219 help, 163
gccontent HEX, 469
Coding, 223 hex, 133
GCcontent, 219 hexadecimal, 131
Noncoding, 222 hexdump, 132
Precoding, 224 hline, 28
StatsOf, 222 Hoover, 406
Genbank, 227, 229, 232 Hoover, 406
genbank HOUR, 477
Complement, 234 hypot, 78
FindKeywordLocs, 232
GeneLocs, 233 identity matrix, 148
GetCodingDNA, 234 IDLE, 213
ParsDNA, 230 IF, 39, 478, 479
ReadFile, 229 if, 95
Translation, 234 IFNULL, 479, 480
gene expression array, 53 importlib
GeneLocs, 233 reload, 169
genetic algorithm, 317, 345 includegraphics, 26
GenomeCodonFreqs, 268 indel, 274
geodesic distance, 395 indiana
GEOMETRY, 461 Convert, 367
geometry, 5 indices, 151
GET FORMAT, 477 IndicWords, 416
GetCodingDNA, 234 inheritance, 174, 179
GetNames, 203 Init1, 388
global, 160 Init2, 388
global alignment, 294 InitGA, 348
global variable, 160, 176 INNER JOIN, 496
gnu inner product, 16, 145

521
INSERT, 455, 457, 470 LENGTH, 469, 473
INSERT INTO, 455 LibreOffice, 30, 33
instance, 173 LibreOffice Base, 439, 445
INSTR, 470 LIKE, 469, 471
int, 75 LIMIT, 472
integer, 458 limit cycle, 262
introns, 211 linalg
IsoBlue, 258 eig, 246
Isolate, 204 LINESTRING, 461
iteration, 98 linked list, 357–359, 364
Linux, 70
JabRef, 30 list, 81
Java Development Kit, 445 little endian, 132
Java runtime environment, 445 LN, 464
JDK, 445 load, 114
JOIN, 496 LOAD FILE, 471
join, 89, 228, 325 LoadBounds, 214
JoinContigs, 342
LoadDNA, 213
JRE, 445
LoadExcel, 200
Juliet, 91
LoadRGBchannels, 258
Jumble, 325
local alignment, 294
Kevin Bacon Effect, 426, 435 local variable, 160, 176
Kevin Bacon effect, 508 LOCALTIME, 477
key, 81 LOCALTIMESTAMP, 477
keys of relation, 500 LOCATE, 469, 473
Kirchhoffs laws, 156 LOESS, 58, 201
kmeans LOESS, 201
AssignMembership, 388 LOG, 464
ClusterAverage, 389 LOG10, 464
ClusterVar, 387 LOG2, 464
Init1, 388 LONGBLOB, 461
Init2, 388 LONGTEXT, 461
Split, 401 LOWER, 470
lower, 88
LAST DAY, 477 LPAD, 470
law of cosines, 13 LTRIM, 470
law of sines, 13
LCASE, 470 MA, 200
Ldata2Array, 200 MacBeth, 168
LEFT, 429, 470 MAKE SET, 470
LEFT JOIN, 496, 497 MAKEDATE, 477
Legalize, 326 MakeG, 509
LEN, 428, 431 MakeRoll, 390
len, 83, 108 MAKETIME, 477

522
maketrans, 91 MIN, 466
mapython miner
AllFiles, 203 AllDcts, 411
GetNames, 203 AllWordDict, 406
Isolate, 204 FiveLetterDict, 408
Ldata2Array, 200 GoodWords, 412
LoadExcel, 200 Hoover, 406
LOESS, 201 IndicWords, 416
MA, 200 WordCountMat, 412
Plot, 201 WordFreqMatrix, 414
Select, 204 WordProb, 414
MATCH, 469 MINUTE, 477
MATCH-AGAINST, 482 mitochondrial DNA, 210
math MOD, 463, 464
acos, 78 module, 168
acosh, 78 MONTH, 477
asin, 78 MONTHNAME, 477
asinh, 78 mRNA, 210
atan, 78 multivariate function, 191
atan2, 78 multivariate normal, 193
atanh, 78 Mutate, 328
cos, 78 Mutation, 321
cosh, 78 mutation, 318
degrees, 78 mycobacterium tuberculosis, 221
hypot, 78 MySQL, 439
pi, 78 MySQL Workbench, 451
pow, 77 mysqldb, 505
radians, 78
sin, 78 National Institutes of Health, 227
sinh, 78 Needleman-Wunsch, 294, 299
sqrt, 77 Neighbors, 396
tan, 78 NewContig, 337
tanh, 78 NIH, 227
MathJax, 27 non-greedy algorithm, 331
matrix, 13 Noncoding, 222
MAX, 466 nongreedy
max, 149, 288, 356 BestPairs, 345
MEDIUMBLOB, 461 CatSeq, 348
MEDIUMTEXT, 461 ConsensusCol, 346
messenger RNA, 210 CostAllGenes, 348, 350
MICROSECOND, 477 InitGA, 348
Microsoft Access, 439 RunGA, 350
MID, 470 SwapMutate, 350
MikTex, 26 nonzero, 143, 386

523
normal, 190 ScrambleImage, 254
NOT, 464 Unscramble, 255
NOT LIKE, 469 PDF, 405
NOT REGEXP, 471 PERIOD ADD, 477
NOW, 477 PERIOD DIFF, 477
nucleotide, 209 PI, 464
nucleus, 209 pi, 78
NULL, 464 pickle, 113
NULLIF, 479 dump, 114
NUMERIC, 459 load, 114
NumPy, 263 pip, 70
numpy, 72, 103, 141, 183, 246, 356 Plot, 201
array, 142 POINT, 461
atan2, 394 polar coordinates, 12
cov, 243 POLYGON, 461
ones, 141 polynomial, 6
zeros, 141 pop, 84, 386
Porter Stemming, 408, 409
object, 173 POSITION, 469
object-oriented programming, 173 POW, 464
OCT, 470 pow, 77
OFFSET, 432 POWER, 464
ones, 141 power terms, 5
online Python, 72 Precoding, 224
open, 111, 114 primary key, 423, 438, 449
OpenGIS, 461 principal component analysis, 241, 247
openpyxl, 126 Project, 250
OR, 464 protein, 210, 312
ORD, 469 pymysql, 505
ord, 133 pyPdf, 406
ORDER BY, 472 Pythagorean theorem, 9, 12, 14
orthonormal, 246, 263 Python Image Library, 72
OUTER JOIN, 496, 498 pythonanywhere, 72
outer product, 16, 145
QUARTER, 477
pack, 134 QUERY EXPANSION, 482
PAM, 277, 283 QUOTE, 471
parent node, 359
ParseDNA, 230 RADIANS, 464
pass, 165 radians, 105
PCA, 241, 246, 247, 263, 271 radians, 78
PCA, 250 RAND, 464
pca rand, 142, 183
LoadRGBchannels, 258 random

524
choice, 193 round, 75
rand, 183 RPAD, 470
ranf, 183 RTRIM, 470
shuffle, 195, 324 RunAnn, 302, 313
random, 102, 142 RunFloyd, 509
random number, 183 RunGA, 350
random numbers, 190 RunKMeans, 390
random slicing, 144
RandomLetter, 313 Save, 137, 201
RandomSwap, 309 SaveData, 139
ranf, 142, 183 scatter plot, 43
range, 99, 108, 163 schema, 438, 485
ReadData, 138 scipy, 72, 141, 184
ReadFile, 229 ScoreParam, 375
ReadPBAS, 138 scoring matrix, 283
ReadRecord, 136 ScoringMatrix, 286
REAL, 459 ScrambleImage, 254
rectangle SEC TO TIME, 477
area, 9 SECOND, 477
rectilinear coordinates, 12 security, 422
reference, 34 seek, 113
absolute, 36 SELECT, 457
relative, 35 Select, 204
REGEXP, 464, 471 sensitivity analysis, 262
relative reference, 35 server, 439
reload, 169 SET, 461
remove, 84 set, 83
RemoveBadNoP, 510 set, 93, 412
REPEAT, 470 set printoptions, 142
REPLACE, 470 ShiftedSeqs, 337
replace, 89 SHOW TABLE, 454
return, 164 ShowContigs, 337, 342
REVERSE, 470 shuffle, 309
rfind, 88 shuffle, 195, 324
rgbpca SIGN, 464
IsoBloue, 258 signed integer, 459
ribosome, 210 simann1
RIGHT, 470 CostFunction, 302
RIGHT JOIN, 496, 498 RunAnn, 302
right triangle, 9 simann2
RLIKE, 471 AlphaAnn, 310
Romeo, 91 simann3
Romeo and Juliet, 91, 166 RandomSwap, 309
ROUND, 464 simann4

525
CostFunction, 312 count, 88
RandomLetter, 313 find, 88, 175
RunAnn, 313 join, 89, 228
simplealign lower, 88
BlosumScore, 279 maketrans, 91
BruteForceSlide, 280 replace, 89
SimpleScore, 276 rfind, 88
SimpleScore, 276 split, 88, 123
SimpleScore, 276 translate, 91
simulated annealing, 301, 302, 313 upper, 88
SIN, 464 struct
sin, 78 unpack, 134
sinh, 78 SUBDATE, 477
slicing, 83, 86 subquery, 500
Smith-Waterman, 294, 296, 297, 299 SUBSTR, 470, 472
Solver, 48, 50 SUBSTRING, 470
sort, 357 SUBSTRING INDEX, 469
SOUNDEX, 469 SUBTIME, 477
SOUNDS LIKE, 464, 469 suffix tree, 410
SPACE, 470 SUM, 39, 466
sphere, 10 sum, 148
area, 10 SwapMutate, 350
volume, 11 SWBacktrace, 294
splice, 211 swissroll
Split, 401 AssignMembership, 396
split, 123 FastFloyd, 395
split, 88, 116 GnuPlotFiles, 390
spreadsheet, 33, 121 GoPolar, 393
SQRT, 464 MakeRoll, 390
sqrt, 77 Neighbors, 396
square, 5 RunKMeans, 390
square root, 6 SYSDATE, 477
standard deviation, 187, 242
start codon, 211 table, 28
StatsOf, 222 tabular, 28
STD, 466 TAN, 464
STDDEV, 466 tan, 78
STDDEV POP, 466 tanh, 78
STDDEV SAMP, 466 tell, 113
STDEV, 39, 429 terminal node, 359
STR TO DATE, 477 TestData, 313
STRCMP, 469 TEXT, 461
string, 86, 460 thymine, 210
ascii lowercase, 324 Tikz, 26

526
TIME, 460, 477 VAR POP, 466
TIME FORMAT, 477 VAR SAMP, 466
TIME TO SEC, 477 VARBINARY, 461
TIMEDIFF, 477 VARCHAR, 461
TIMESTAMP, 477 variable, 5
TIMESTAMPADD, 477 VARIANCE, 466
TIMESTAMPDIFF, 477 variance, 242
TINYBLOB, 461 vector, 13
TINYTEXT, 461 volume, 11
TO DAYS, 477
Trace, 513 WEEK, 477
translate, 91 WEEKDAY, 477
Translation, 234 WEEKOFYEAR, 477
transpose, 146 WHERE, 463
tree, 355 while, 99, 359
Trendline, 45, 48 word, 131
WordCountMat, 412
triangle, 9, 10
WordFreqMatrix, 414
trigonometry, 5
WordProb, 414
TRIM, 470
TRUNCATE, 464 xlrd, 125
tuple, 80 XOR, 464
Tybalt, 91
YEAR, 477
Ubuntu, 71 YEAR(2), 460
UCASE, 470 YEARWEEK, 477
UNHEX, 470
UNIX TIMESTAMP, 477 zeros, 141, 142
unpack, 134
Unscramble, 255
unsigned, 459
Unweighted Pair Group Method with Arith-
metic Mean, 364
UPDATE, 456
UPGMA, 364, 366, 368
UPPER, 470
upper, 88
uracil, 210
ureaplasma parvum serovar, 245
UTC DATE, 477
UTC TIME, 477
UTC TIMESTAMP, 477

value, 81
VALUES, 455

527

You might also like