Computational Methods For Bioinformatics in Python 3.4
Computational Methods For Bioinformatics in Python 3.4
in Python 3.4
Text and images (excepting those attributed to other sources) copyright ©1st edition
2016, 2nd edition 2017 by Jason M. Kinser
Front cover copyright ©2017 by Jason M. Kinser
This document is intended for educational and may not be freely distributed in
written or electronic form without the expressed, written consent of the author.
Python scripts are provided as an educational tool. They are offered without guar-
antee of effectiveness or accuracy. Python scripts composed by the author may not be
used for commercial uses without the author’s explicit written permission.
Feedback
This is an active document in that it will be updated as the sciences, algorithms and
Python scripting methods change. The author does appreciated kind notes that inform
him of alterations needed and errors detected. Please send comments, suggestions and
error reports to: [email protected]
Versions
Version 1.0 September 1, 2016
Version 2.0 January 20, 2017
i
Dedication
This book is dedicated to
Dr. Wallace A. Hilton
and
Dr. Charles D. Geilker
both of whom encouraged young scientists to write.
ii
Contents
Contents i
Preface 1
1 Mathematics Review 5
1.1 Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Power Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Quadratic Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Trigonometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Triangles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.4 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Scientific Writing 19
iii
2.1 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.2 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.3 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.4 Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Word Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 MS - Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 LATEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2.1 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2.2 Title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2.3 Headings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2.4 Cross References . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2.5 Figures and Captions . . . . . . . . . . . . . . . . . . . . . 26
2.2.2.6 Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2.7 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2.8 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.2.9 Final Comments . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 LibreOffice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.4 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.4.1 Google Docs . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.4.2 ABI Word . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.4.3 Zoho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.4.4 WPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
iv
3.3.1 The Sum Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Statistical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.3 Comparison Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Creating Basic Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Function Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Trendline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.2 Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Python Installation 69
5.1 Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.2 MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.3 UNIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Setting up a Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Online Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
v
6.3 Python Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3.1 Tuple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3.2 List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.3 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.4 Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.5 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.1 String Definition and Slicing . . . . . . . . . . . . . . . . . . . . . . 86
6.4.1.1 Special Characters . . . . . . . . . . . . . . . . . . . . . . . 87
6.4.1.2 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4.1.3 Repeating Characters . . . . . . . . . . . . . . . . . . . . . 87
6.4.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.2.1 Replacing Characters . . . . . . . . . . . . . . . . . . . . . 89
6.4.2.2 Replacing Characters with a Table . . . . . . . . . . . . . . 90
6.5 Converting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.6 Example: Romeo and Juliet . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
vi
7.3.4 Example: Compute π . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.3.5 Example: Summation Equations . . . . . . . . . . . . . . . . . . . . 108
7.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
vii
11.3 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.4 Mathematics and Some Functions . . . . . . . . . . . . . . . . . . . . . . . . 144
11.5 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.6 Example: Extract Random Numbers Above a Threshold . . . . . . . . . . . 150
11.7 Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.8 Example: Simultaneous Equations . . . . . . . . . . . . . . . . . . . . . . . 155
viii
14.3 Gaussian Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
14.3.1 Gaussian Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
14.3.2 Gaussian Distributions in Excel . . . . . . . . . . . . . . . . . . . . . 187
14.3.3 Histogram in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
14.3.4 Random Gaussian Numbers . . . . . . . . . . . . . . . . . . . . . . . 190
14.4 Multivariate Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
14.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
14.5.1 Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
14.5.2 Cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
14.5.3 Random DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
ix
17.3.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
x
20.1 Codon Frequency Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
20.1.1 Codon Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
20.1.2 Codon Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
20.1.3 Codon Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
20.1.4 Frequencies of a Genome . . . . . . . . . . . . . . . . . . . . . . . . 267
20.2 Genome Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
20.2.1 Single Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
20.2.2 Two Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
20.3 Comparing Multiple Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . 271
xi
21.7 Optimality in Dynamic Programming . . . . . . . . . . . . . . . . . . . . . 296
21.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
xii
24 Multiple Sequence Alignment 331
24.1 Multiple Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
24.2 The Greedy Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
24.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
24.2.2 Theory of the Assembly . . . . . . . . . . . . . . . . . . . . . . . . . 334
24.2.3 An Intricate Example . . . . . . . . . . . . . . . . . . . . . . . . . . 334
24.2.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
24.2.3.2 Pairwise Alignments . . . . . . . . . . . . . . . . . . . . . . 335
24.2.3.3 Initial Contigs . . . . . . . . . . . . . . . . . . . . . . . . . 337
24.2.3.4 Adding to a Contig . . . . . . . . . . . . . . . . . . . . . . 339
24.2.3.5 Joining Contigs . . . . . . . . . . . . . . . . . . . . . . . . 341
24.2.3.6 The Assembly . . . . . . . . . . . . . . . . . . . . . . . . . 343
24.3 The Non-Greedy Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
24.3.1 Creating Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
24.3.2 Steps in the Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . 348
24.3.3 The Test Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
24.3.4 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
24.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
25 Trees 355
25.1 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
25.2 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
25.3 Linked Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
25.4 Binary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
25.5 UPGMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
25.6 Non-Binary Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
25.7 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
25.7.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
25.7.2 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
25.7.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
25.7.2.2 Scoring a Parameter . . . . . . . . . . . . . . . . . . . . . . 375
xiii
25.7.2.3 A Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
25.7.2.4 The Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
25.7.2.5 A Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
26 Clustering 383
26.1 Purpose of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
26.2 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
26.3 More Difficult Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
26.3.1 New Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . 393
26.3.2 Modification of k-means . . . . . . . . . . . . . . . . . . . . . . . . . 394
26.4 Dynamic k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
26.5 Comments on k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
26.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
IV Database 419
xiv
28.2 The Query List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
28.3 Answering the Queries in a Spreadsheet . . . . . . . . . . . . . . . . . . . . 426
xv
30.6.1 Math Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
30.6.2 Math Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
30.6.3 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
30.6.4 Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
30.6.5 Aggregate Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
30.6.6 Sample Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
30.7 String Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
30.8 Limits and Sorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
30.9 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
30.10Time and Date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
30.11Casting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
30.12Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
30.12.1 CASE-WHEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
30.12.2 The IF Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
30.12.3 The IFNULL Statement . . . . . . . . . . . . . . . . . . . . . . . . . 480
30.12.4 Natural Language Comparisons . . . . . . . . . . . . . . . . . . . . . 480
30.13Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
xvi
31.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
xvii
xviii
List of Figures
xix
3.6 A better way of creating several similar computations. . . . . . . . . . . . . 37
3.7 All formulas in column C reference cell B1. . . . . . . . . . . . . . . . . . . 37
3.8 Changing the name of a cell. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.9 Using the named cell in referenced computations. . . . . . . . . . . . . . . . 39
3.10 Computing the sum of a set of values. . . . . . . . . . . . . . . . . . . . . . 40
3.11 Computing the average and standard deviation. . . . . . . . . . . . . . . . . 40
3.12 Constructing an IF statement. . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.13 Copying the formula to cells in column C. . . . . . . . . . . . . . . . . . . . 42
3.14 Using the COUNTIF function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.15 Creating a line graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.16 Creating a scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.17 Altering the x axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.18 Accessing the Trendline tool. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.19 Trendline interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.20 Perfect fit trendline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.21 Trendline shown with noisy data. . . . . . . . . . . . . . . . . . . . . . . . . 47
3.22 Raw data which is a noisy bell curve. . . . . . . . . . . . . . . . . . . . . . . 48
3.23 The spreadsheet architecture for Solver. . . . . . . . . . . . . . . . . . . . . 49
3.24 The Solver interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.25 Plots of the original data and the Solver estimate. . . . . . . . . . . . . . . 50
xx
4.11 Plot of the data with the average removed. . . . . . . . . . . . . . . . . . . 61
4.12 A partial view of data from all of the files after LOESS normalization. . . . 62
4.13 The average and standard deviation of the first three files. . . . . . . . . . . 62
4.14 The data after the average is subtracted. . . . . . . . . . . . . . . . . . . . . 62
4.15 The data after division by the standard deviation. . . . . . . . . . . . . . . 63
4.16 Data available to answer the male-only question. . . . . . . . . . . . . . . . 63
4.17 Comparing the values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.18 Accessing conditional formatting. . . . . . . . . . . . . . . . . . . . . . . . . 64
4.19 Changing the format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.20 Partial results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
xxi
12.3 Directory structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
12.4 Creating a new file in IDLE. . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
12.5 Contents of a module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
12.6 Changing the contents of a module. . . . . . . . . . . . . . . . . . . . . . . . 170
16.1 A simple depiction of a cell with a nucleus, cytoplasm, nuclear DNA and
mitochondrial DNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
16.2 A caricature of the double helix nature of DNA. . . . . . . . . . . . . . . . 210
16.3 The ribosome travels along the DNA using codon information to create a
chain of amino acids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
16.4 Codon to Amino Acid Conversion . . . . . . . . . . . . . . . . . . . . . . . . 212
16.5 Spliced segments of the DNA are used to create a single protein. . . . . . . 213
xxii
18.1 FASTA file example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
18.2 Genbank file example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
18.3 Genbank file example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
18.4 Information on an individual gene. . . . . . . . . . . . . . . . . . . . . . . . 231
18.5 Indications of complements and joins. . . . . . . . . . . . . . . . . . . . . . 233
xxiii
20.1 The statistics for an entire genome. . . . . . . . . . . . . . . . . . . . . . . . 269
20.2 The statistics for the first 20 codons for two genomes. . . . . . . . . . . . . 270
20.3 PCA mapping for several bacterial genomes. . . . . . . . . . . . . . . . . . . 271
21.1 The first column and row are filled in. . . . . . . . . . . . . . . . . . . . . . 284
21.2 The S1,1 cell is filled in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
21.3 The lines indicate which elements are computed in a single Python command.291
21.4 A pictorial view of the scoring matrix. Darker pixels relate to higher values
in the matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
xxiv
25.15The third iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
25.16The third iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
25.17The tree for the results in Code 25.9. . . . . . . . . . . . . . . . . . . . . . . 367
25.18A nonbinary tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
25.19Data distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
25.20A decision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
25.21Closer to reality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
25.22Distribution of people for three variables. . . . . . . . . . . . . . . . . . . . 372
25.23A decision node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
25.24A decision tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
xxv
28.11A portion of the window that shows the average for each year. . . . . . . . 432
28.12The movies of aid = 281. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
28.13The logic flow for obtaining the name of a movie from two actors. . . . . . . 434
28.14Finding the common elements in two lists. . . . . . . . . . . . . . . . . . . . 434
28.15Counting the numer of movies for each actor. . . . . . . . . . . . . . . . . . 435
xxvi
List of Tables
xxvii
30.7 Aggregate functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
30.8 Pattern matching string operators. . . . . . . . . . . . . . . . . . . . . . . . 469
30.9 Informative string operators. . . . . . . . . . . . . . . . . . . . . . . . . . . 469
30.10Informative string operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
30.11Substring operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
30.12Capitalization operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
30.13Alteration operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
30.14Miscellaneous operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
30.15Casting Operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
30.16Decision operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
xxviii
Python Codes
xxix
6.20 A dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.21 Accessing data in a dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.22 Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.23 Length of a collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.24 Slicing examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.25 More slicing examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.26 Accessing a collection inside of a collection. . . . . . . . . . . . . . . . . . . 85
6.27 Insertion into a list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.28 The pop function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.29 The remove function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.30 Creating a string. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.31 Simple slicing in strings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.32 Special characters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.33 Concatenation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.34 Repeating characters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.35 Using the find function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.36 Using the count function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.37 Conversion to upper or lower case. . . . . . . . . . . . . . . . . . . . . . . . 89
6.38 Using the split and join functions. . . . . . . . . . . . . . . . . . . . . . . . 89
6.39 Using the replace function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.40 Creating a complement string. . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.41 Using the maketrans and translate functions. . . . . . . . . . . . . . . . . 91
6.42 Converting data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.43 Counting names in the play. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.44 The first Romeo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.45 Counting Romeo and Juliet at the end of sentences. . . . . . . . . . . . . . 92
6.46 Collecting individual words. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.1 The skeleton for a for loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 The if statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3 Two commands inside of an if statement. . . . . . . . . . . . . . . . . . . . 96
7.4 Using the else statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.5 A compound statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.6 A compound statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.7 Using parenthesis in a compound statement. . . . . . . . . . . . . . . . . . . 98
7.8 Using the elif statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.9 Using a while loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.10 Using a for loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.11 The range function in Python 2.7. . . . . . . . . . . . . . . . . . . . . . . . 100
7.12 Using the break statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.13 Using the continue statement. . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.14 Using the enumerate function. . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.15 Generating random numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.16 Collecting random numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.17 Computing the average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
xxx
7.18 A more efficient method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.19 Loading Romeo and Juliet. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.20 Capturing all of the words that follow ‘the ’. . . . . . . . . . . . . . . . . . . 104
7.21 Isolating unique words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.22 Computations for the sliding block. . . . . . . . . . . . . . . . . . . . . . . . 106
7.23 Computing π with random numbers. . . . . . . . . . . . . . . . . . . . . . . 107
7.24 The initial data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.25 Summing the values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.26 More efficient code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.27 Code for the average function. . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.1 Reading a file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.2 Accessing files in another directory. . . . . . . . . . . . . . . . . . . . . . . . 112
8.3 Opening a file for writing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.4 Opening a file for writing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.5 Using the seek command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.6 Saving data with the pickle module. . . . . . . . . . . . . . . . . . . . . . . 114
8.7 Loading data from the pickle module. . . . . . . . . . . . . . . . . . . . . . 114
8.8 Reading the DNA file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.9 Counting the occurrences of the letter ‘t’. . . . . . . . . . . . . . . . . . . . 115
8.10 A sliding window count. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.11 The sliding window for the entire string. . . . . . . . . . . . . . . . . . . . . 116
8.12 Reading the sales data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.13 Splitting the data on newline and tab. . . . . . . . . . . . . . . . . . . . . . 118
8.14 Splitting the first data line. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.15 Converting data to floats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.16 Converting all of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.1 Loading the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.2 Separating the rows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.3 Determining the columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.4 Gathering the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.5 Using the csv module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.6 Using the xlrd module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.7 Converting the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.8 Using openpyxl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
9.9 Alternative usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
10.1 Using Python for character conversions. . . . . . . . . . . . . . . . . . . . . 133
10.2 ABI version number. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.3 Reading the first record. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.4 Interpreting the bytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.5 The ReadRecord function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.6 The base calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.7 The first data record. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.8 Retrieving the first channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.9 The ReadPBAS function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
xxxi
10.10The ReadData function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
10.11The SaveData function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.12The Driver function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
11.1 Creating a vector of zeros. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.2 Creating other types of vectors. . . . . . . . . . . . . . . . . . . . . . . . . . 142
11.3 Setting the printing precision . . . . . . . . . . . . . . . . . . . . . . . . . . 142
11.4 Creating a matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.5 Creating a matrix of random values. . . . . . . . . . . . . . . . . . . . . . . 143
11.6 Extracting elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.7 Extracting a sub-matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.8 Extracting qualifying indexes. . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.9 Extracting qualifying elements. . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.10Modifying the matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
11.11Adding two matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
11.12Addition of arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.13Elemental subtraction and multiplication. . . . . . . . . . . . . . . . . . . . 146
11.14Dot product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.15Matrix dot product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.16Transpose and inverse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.17Matrix inversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.18Some functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.19Retrieving information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.20Varieties of summation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
11.21Finding the maximum value. . . . . . . . . . . . . . . . . . . . . . . . . . . 149
11.22Using argsort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
11.23Using divmod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
11.24Extracting qualifying values. . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.25Using the indices function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
11.26Shifting the arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
11.27The distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
11.28The average of an area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
11.29Solving simultaneous equations. . . . . . . . . . . . . . . . . . . . . . . . . . 155
12.1 A basic function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
12.2 Attempting to access a local variable outside of the function. . . . . . . . . 160
12.3 Executing a function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.4 Using the global command. . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.5 Using an argument. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.6 Using two arguments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.7 Incorrect use of an argument. . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.8 A default argument. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.9 Multiple default arguments. . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.10The help function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.11Adding comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.12Using help on a user-defined function. . . . . . . . . . . . . . . . . . . . . . 165
xxxii
12.13Using the return command. . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.14Returning multiple values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
12.15Function outlining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
12.16Adding a command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
12.17Adding the rest of the commands . . . . . . . . . . . . . . . . . . . . . . . . 167
12.18Example calls of a function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
12.19The os and sys modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
12.20Importing a module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
12.21Reloading a module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
12.22Using the from ... import construct. . . . . . . . . . . . . . . . . . . . . 171
12.23Executing a file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
13.1 A very basic class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.2 A string example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.3 Demonstrating the importance of self. . . . . . . . . . . . . . . . . . . . . 176
13.4 Distinguishing local and global variables. . . . . . . . . . . . . . . . . . . . . 176
13.5 Theoretical code showing implementation of a new definition for the addi-
tion operator.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.6 Overloading the addition operator. . . . . . . . . . . . . . . . . . . . . . . . 178
13.7 Examples for overloading slicing and string conversion. . . . . . . . . . . . . 179
13.8 An example of inheritance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.9 Creating new variables after the creation of an object. . . . . . . . . . . . . 181
14.1 A random number. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
14.2 Many random numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
14.3 A correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
14.4 A histogram in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
14.5 Help on a normal distribution. . . . . . . . . . . . . . . . . . . . . . . . . . 191
14.6 A normal distribution in Python. . . . . . . . . . . . . . . . . . . . . . . . . 191
14.7 A larger distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
14.8 A multivariate distribution in Python. . . . . . . . . . . . . . . . . . . . . . 193
14.9 Computing the statistics of a large multivariate distribution. . . . . . . . . 193
14.10Random dice rolls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
14.11Random dice rolls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
14.12Distribution of a large number of rolls. . . . . . . . . . . . . . . . . . . . . . 194
14.13Random cards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
14.14Shuffled cards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
14.15Random DNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
15.1 The LoadExcel function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
15.2 The Ldata2Array function. . . . . . . . . . . . . . . . . . . . . . . . . . . 201
15.3 The MA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
15.4 The Plot function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
15.5 The LOESS function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
15.6 Processing a single file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
15.7 The GetNames function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
15.8 The AllFiles function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
xxxiii
15.9 The Select function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
15.10The Isolate function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
16.1 The LoadDNA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
16.2 Using the LoadDNA function. . . . . . . . . . . . . . . . . . . . . . . . . . 214
16.3 The LoadBounds function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
16.4 Length of a gene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
16.5 Considering a complement string. . . . . . . . . . . . . . . . . . . . . . . . . 216
16.6 The CheckForStartsStops function. . . . . . . . . . . . . . . . . . . . . . 217
16.7 The final test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
17.1 The GCcontent function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
17.2 Using the GCcontent function. . . . . . . . . . . . . . . . . . . . . . . . . 220
17.3 Loading data for mycobacterium tuberculosis. . . . . . . . . . . . . . . . . . 221
17.4 The Noncoding function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
17.5 The StatsOf function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
17.6 The statistics from the non-coding regions. . . . . . . . . . . . . . . . . . . 223
17.7 The Coding function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
17.8 The statistics from the coding regions. . . . . . . . . . . . . . . . . . . . . . 223
17.9 The Precoding function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
17.10The statistics from the pre-coding regions. . . . . . . . . . . . . . . . . . . . 224
18.1 Reading a file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
18.2 Displaying the contents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
18.3 Creating a long string. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
18.4 Performing all in a single command. . . . . . . . . . . . . . . . . . . . . . . 229
18.5 The ReadFile function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
18.6 Calling the ParseDNA function. . . . . . . . . . . . . . . . . . . . . . . . . 231
18.7 Using the FindKeyWords function. . . . . . . . . . . . . . . . . . . . . . . 232
18.8 Results from GeneLocs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
18.9 The Complement function. . . . . . . . . . . . . . . . . . . . . . . . . . . 234
18.10Calling the GetCodingDNA function. . . . . . . . . . . . . . . . . . . . . 234
18.11Using the Translation function. . . . . . . . . . . . . . . . . . . . . . . . . 235
18.12The opening lines of an ASN.1 file. . . . . . . . . . . . . . . . . . . . . . . . 236
18.13The DNA section in an ASN.1 file. . . . . . . . . . . . . . . . . . . . . . . . 236
18.14The DecoderDict function. . . . . . . . . . . . . . . . . . . . . . . . . . . 237
18.15The DNAFromASN1 function. . . . . . . . . . . . . . . . . . . . . . . . . 238
18.16DNA locations within an ANS.1 file.. . . . . . . . . . . . . . . . . . . . . . . 239
19.1 The covariance matrix of random data. . . . . . . . . . . . . . . . . . . . . . 244
19.2 The covariance matrix of modified data. . . . . . . . . . . . . . . . . . . . . 244
19.3 Testing the eigenvector engine. . . . . . . . . . . . . . . . . . . . . . . . . . 246
19.4 Proving that the eigenvectors are orthonormal. . . . . . . . . . . . . . . . . 247
19.5 The PCA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
19.6 The Project function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
19.7 The AllDistances function. . . . . . . . . . . . . . . . . . . . . . . . . . . 251
19.8 The distance test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
19.9 The first two dimensions in PCA space. . . . . . . . . . . . . . . . . . . . . 253
xxxiv
19.10The ScrambleImage function. . . . . . . . . . . . . . . . . . . . . . . . . . 254
19.11The process of unscrambling the rows. . . . . . . . . . . . . . . . . . . . . . 255
19.12The Unscramble function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
19.13Various calls to the Unscramble function. . . . . . . . . . . . . . . . . . . 256
19.14The LoadImage and IsoBlue functions. . . . . . . . . . . . . . . . . . . . 258
19.15Running a system for 20 iterations. . . . . . . . . . . . . . . . . . . . . . . . 260
19.16Computing data for a limit cycle. . . . . . . . . . . . . . . . . . . . . . . . . 262
20.1 The CodonTable function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
20.2 The CountCodons function. . . . . . . . . . . . . . . . . . . . . . . . . . . 267
20.3 Computing the codon frequencies. . . . . . . . . . . . . . . . . . . . . . . . 267
20.4 The CodonFreqs function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
20.5 The GenomeCodonFreqs function. . . . . . . . . . . . . . . . . . . . . . . 268
20.6 Calling the Candlesticks function. . . . . . . . . . . . . . . . . . . . . . . . 269
20.7 Creating plots for two genomes. . . . . . . . . . . . . . . . . . . . . . . . . . 270
21.1 The SimpleScore function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
21.2 Accessing the BLOSUM50 matrix and its associated alphabet. . . . . . . . 278
21.3 Accessing an element in the matrix. . . . . . . . . . . . . . . . . . . . . . . 278
21.4 Accessing an element in the matrix. . . . . . . . . . . . . . . . . . . . . . . 279
21.5 The BlosumScore function. . . . . . . . . . . . . . . . . . . . . . . . . . . 280
21.6 The BruteForceSlide function. . . . . . . . . . . . . . . . . . . . . . . . . 281
21.7 Aligning the sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
21.8 Creating two similar sequences. . . . . . . . . . . . . . . . . . . . . . . . . . 284
21.9 The ScoringMatrix function. . . . . . . . . . . . . . . . . . . . . . . . . . 286
21.10Using the ScoringMatrix function. . . . . . . . . . . . . . . . . . . . . . . 287
21.11The arrow matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
21.12The Backtrace function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
21.13The FastSubValues function. . . . . . . . . . . . . . . . . . . . . . . . . . 290
21.14The CreateIlist function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
21.15Using the CreateIlist function. . . . . . . . . . . . . . . . . . . . . . . . . . 292
21.16The FastNW function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
21.17Using the FastNW function. . . . . . . . . . . . . . . . . . . . . . . . . . . 293
21.18Results from the FastSW function. . . . . . . . . . . . . . . . . . . . . . . 295
21.19A local alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
21.20An example alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
21.21Returned alignments are considerably longer than 10 elements. . . . . . . . 297
22.1 The CostFunction function. . . . . . . . . . . . . . . . . . . . . . . . . . . 302
22.2 The RunAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
22.3 Using the RunAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . . 304
22.4 The GenVectors function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
22.5 The modified RunAnn function. . . . . . . . . . . . . . . . . . . . . . . . . 306
22.6 Using the RunAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . . 306
22.7 An example with a decay that is too fast. . . . . . . . . . . . . . . . . . . . 306
22.8 Checking the answer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
22.9 The RandomSwap function. . . . . . . . . . . . . . . . . . . . . . . . . . . 310
xxxv
22.10The CostFunction function. . . . . . . . . . . . . . . . . . . . . . . . . . . 310
22.11The AlphaAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
22.12Using the AlphaAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . 311
22.13An alignment score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
22.14The CostFunction function. . . . . . . . . . . . . . . . . . . . . . . . . . . 313
22.15Examples of the cost function. . . . . . . . . . . . . . . . . . . . . . . . . . 313
22.16The TestData function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
22.17The RandomLetter function. . . . . . . . . . . . . . . . . . . . . . . . . . 314
22.18The RunAnn function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
22.19Comparing the computed result to the original. . . . . . . . . . . . . . . . . 315
23.1 The CostFunction function. . . . . . . . . . . . . . . . . . . . . . . . . . . 320
23.2 Employing the CrossOver function. . . . . . . . . . . . . . . . . . . . . . . 321
23.3 Employing the CrossOver function. . . . . . . . . . . . . . . . . . . . . . . 321
23.4 The first elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
23.5 The DriveGA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
23.6 A typical run. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
23.7 Copying textual data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
23.8 The Jumble function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
23.9 Using the Jumble function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
23.10The CostFunction function. . . . . . . . . . . . . . . . . . . . . . . . . . . 326
23.11The Legalizefunction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
23.12Using the Legalizefunction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
23.13The modified Mutate function. . . . . . . . . . . . . . . . . . . . . . . . . . 328
23.14The DriveSortGA function. . . . . . . . . . . . . . . . . . . . . . . . . . . 329
24.1 The ChopSeq function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
24.2 Using the ChopSeq function. . . . . . . . . . . . . . . . . . . . . . . . . . . 334
24.3 Extracting a protein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
24.4 Creating the segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
24.5 Pairwise alignments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
24.6 Starting the assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
24.7 Use the ShiftedSeqs function. . . . . . . . . . . . . . . . . . . . . . . . . . 337
24.8 Using the NewContig function. . . . . . . . . . . . . . . . . . . . . . . . . 338
24.9 Finding the next largest element. . . . . . . . . . . . . . . . . . . . . . . . . 338
24.10Creating a second contig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
24.11Determining that the action is to add to a contig. . . . . . . . . . . . . . . . 339
24.12Using the Add2Contig function. . . . . . . . . . . . . . . . . . . . . . . . . 340
24.13Do nothing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
24.14The third contig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
24.15Adding to a contig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
24.16Locating contigs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
24.17Joining contigs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
24.18Showing a latter portion of the assembly. . . . . . . . . . . . . . . . . . . . 342
24.19The Assemble function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
24.20Running the assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
xxxvi
24.21The commands for an assembly. . . . . . . . . . . . . . . . . . . . . . . . . . 346
24.22Using the BestPairs function. . . . . . . . . . . . . . . . . . . . . . . . . . 346
24.23Showing two parts of the assembly. . . . . . . . . . . . . . . . . . . . . . . . 347
24.24The ConsensusCol function. . . . . . . . . . . . . . . . . . . . . . . . . . . 348
24.25The CatSeq function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
24.26The InitGA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
24.27The CostAllGenes function. . . . . . . . . . . . . . . . . . . . . . . . . . . 349
24.28Using the CostAllGenes function. . . . . . . . . . . . . . . . . . . . . . . . 350
24.29Using the CostAllGenes function for the offspring. . . . . . . . . . . . . . 350
24.30The RunGA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
24.31Using the Assemble function. . . . . . . . . . . . . . . . . . . . . . . . . . 351
25.1 A slow method to find a maximum value. . . . . . . . . . . . . . . . . . . . 356
25.2 Using commands to sort the data. . . . . . . . . . . . . . . . . . . . . . . . 357
25.3 Populating the dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
25.4 Printing the results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
25.5 Initiating a tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
25.6 Creating data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
25.7 Making M and partially filling it with data. . . . . . . . . . . . . . . . . . . 365
25.8 Altering M after the creation of a new vector. . . . . . . . . . . . . . . . . . 366
25.9 The UPGMA function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
25.10Using the Convert function. . . . . . . . . . . . . . . . . . . . . . . . . . . 369
25.11The FakeDtreeData function. . . . . . . . . . . . . . . . . . . . . . . . . . 374
25.12Using the FakeDtreeData function. . . . . . . . . . . . . . . . . . . . . . . 375
25.13Separating the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
25.14Concepts of the ScoreParam function. . . . . . . . . . . . . . . . . . . . . 376
25.15The variable and function names in the Node class. . . . . . . . . . . . . . . 377
25.16The titles in the TreeClass. . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
25.17Initializing the Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
25.18The information of the mother node. . . . . . . . . . . . . . . . . . . . . . . 380
25.19Making the tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
25.20Comparing the patient to the first node. . . . . . . . . . . . . . . . . . . . . 381
25.21The final node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
25.22Running a trace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
26.1 The CData function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
26.2 The CompareVecs function. . . . . . . . . . . . . . . . . . . . . . . . . . . 384
26.3 Saving the data for GnuPlot. . . . . . . . . . . . . . . . . . . . . . . . . . . 385
26.4 The CheapClustering function. . . . . . . . . . . . . . . . . . . . . . . . . 386
26.5 The ClusterVar function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
26.6 Initialization functions for k-means. . . . . . . . . . . . . . . . . . . . . . . 388
26.7 The AssignMembership function. . . . . . . . . . . . . . . . . . . . . . . 389
26.8 The ClusterAverage function. . . . . . . . . . . . . . . . . . . . . . . . . . 389
26.9 The KMeans function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
26.10A typical run of the k-means clustering algorithm. . . . . . . . . . . . . . . 391
26.11The MakeRoll function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
xxxvii
26.12The RunKMeans function. . . . . . . . . . . . . . . . . . . . . . . . . . . 392
26.13The GnuPlotFiles function. . . . . . . . . . . . . . . . . . . . . . . . . . . 392
26.14The GoPolar function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
26.15Calling the k-means function. . . . . . . . . . . . . . . . . . . . . . . . . . . 394
26.16The FastFloyd function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
26.17The Neighbors function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
26.18The AssignMembership function. . . . . . . . . . . . . . . . . . . . . . . 398
26.19A new problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
26.20Cluster variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
26.21The Split function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
26.22The final clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
27.1 The Hoover function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
27.2 The AllWordDict function. . . . . . . . . . . . . . . . . . . . . . . . . . . 407
27.3 A list of cleaned words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
27.4 The FiveLetterDict function. . . . . . . . . . . . . . . . . . . . . . . . . . 408
27.5 A few examples the failed in Porter Stemming. . . . . . . . . . . . . . . . . 409
27.6 The AllDcts function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
27.7 The GoodWords function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
27.8 The WordCountMat function. . . . . . . . . . . . . . . . . . . . . . . . . 413
27.9 A few statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
27.10The WordFreqMatrix function. . . . . . . . . . . . . . . . . . . . . . . . . 415
27.11The WordProb function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
27.12The IndicWords function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
27.13Using the IndicWords function. . . . . . . . . . . . . . . . . . . . . . . . . 416
27.14Scoring documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
29.1 An example query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
29.2 Connecting to MySQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
29.3 Creating a table in MySQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
29.4 Uploading a CSV file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
29.5 Using mysqldump. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
29.6 An example query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
30.1 Creating a database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
30.2 Creating a table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
30.3 Showing a table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
30.4 Describing a table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
30.5 Dropping a table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
30.6 Inserting data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
30.7 Multiple inserts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
30.8 Altering data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
30.9 Updating data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
30.10Granting privileges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
30.11The basic query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
30.12Selecting movies in a specified year. . . . . . . . . . . . . . . . . . . . . . . 458
30.13Creating a table with a default value. . . . . . . . . . . . . . . . . . . . . . 460
xxxviii
30.14Creating an enumeration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
30.15Example of CAST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
30.16Example of a math operator. . . . . . . . . . . . . . . . . . . . . . . . . . . 463
30.17Example of a math function. . . . . . . . . . . . . . . . . . . . . . . . . . . 463
30.18Selecting movies from a grade range. . . . . . . . . . . . . . . . . . . . . . . 466
30.19Selecting movies from a year range. . . . . . . . . . . . . . . . . . . . . . . . 467
30.20Selecting years with movie with a grade of 1. . . . . . . . . . . . . . . . . . 467
30.21Returning the number of actors from a specified movie. . . . . . . . . . . . 468
30.22The average grade of the movies in the 1950’s. . . . . . . . . . . . . . . . . 468
30.23A demonstration of AS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
30.24Statistics on the length of the movie name. . . . . . . . . . . . . . . . . . . 469
30.25Finding the Keatons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
30.26Finding the Johns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
30.27Finding the actors with two parts to their first name. . . . . . . . . . . . . . 472
30.28Finding the actors with identical initials. . . . . . . . . . . . . . . . . . . . . 472
30.29Example of the LIMIT function. . . . . . . . . . . . . . . . . . . . . . . . . 473
30.30Sorting a simple search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
30.31The movies with the longest titles. . . . . . . . . . . . . . . . . . . . . . . . 474
30.32Sorting actors by the location of ‘as’. . . . . . . . . . . . . . . . . . . . . . . 474
30.33Determining the average grade for each year. . . . . . . . . . . . . . . . . . 475
30.34Sorting the years by average grade. . . . . . . . . . . . . . . . . . . . . . . . 475
30.35Restricting the search to years with more than 5 movies. . . . . . . . . . . . 476
30.36Using CURDATE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
30.37Right now. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
30.38Casting data types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
30.39Using CASE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
30.40Using IF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
30.41Using IFNULL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
30.42The FULLTEXT operator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
30.43Load data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
30.44Using MATCH-AGAINST. . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
30.45Using QUERY-EXPANSION. . . . . . . . . . . . . . . . . . . . . . . . . . . 483
31.1 A query using two tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
31.2 A query using three tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
31.3 The average grade for John Goodman. . . . . . . . . . . . . . . . . . . . . . 488
31.4 Movies in French. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
31.5 Languages of Peter Falk movies. . . . . . . . . . . . . . . . . . . . . . . . . 489
31.6 Movies common to Daniel Radcliffe and Maggie Smith. . . . . . . . . . . . 491
31.7 Radcliffe’s aid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
31.8 Radcliffe’s mid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
31.9 Radcliffe’s mid with renaming. . . . . . . . . . . . . . . . . . . . . . . . . . 492
31.10The mids with both Smith and Radcliffe. . . . . . . . . . . . . . . . . . . . 493
31.11The aid of other actors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
31.12Unique actors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
xxxix
31.13Actors common to movies with Daniel Radcliffe and Maggie Smith. . . . . . 495
31.14The mids for Cary Grant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
31.15The titles with ‘under’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
31.16Inner join with multiple returns. . . . . . . . . . . . . . . . . . . . . . . . . 498
31.17Left join with multiple returns. . . . . . . . . . . . . . . . . . . . . . . . . . 499
31.18Left excluding joins.[Moffatt, 2009] . . . . . . . . . . . . . . . . . . . . . . . 500
31.19The movie listed with each actor. . . . . . . . . . . . . . . . . . . . . . . . . 501
31.20The use of a subquery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
31.21Assigning an alias to a subquery. . . . . . . . . . . . . . . . . . . . . . . . . 502
31.22The top 5 actors in terms of number of appearances. . . . . . . . . . . . . . 503
31.23The actors with the best average scores. . . . . . . . . . . . . . . . . . . . . 503
32.1 Creating the connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
32.2 Sending a query and retrieving a response. . . . . . . . . . . . . . . . . . . . 506
32.3 Committing changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
32.4 Sending multiple queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
32.5 Sending multiple queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
32.6 Sending multiple queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
32.7 The DumpActors function. . . . . . . . . . . . . . . . . . . . . . . . . . . 510
32.8 The MakeG function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
32.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
32.10The RemoveBadNoP function. . . . . . . . . . . . . . . . . . . . . . . . . 512
32.11The path from Hanks to Sheen. . . . . . . . . . . . . . . . . . . . . . . . . . 513
32.12The Trace function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
xl
Preface
This textbook is designed for students that have some background in biological sciences
but very little in computer programming. Students are expected to have a beginner’s
knowledge of how to use a computer which includes moving files, understanding file struc-
tures, rudimentary office software skills, and a cursory understanding of core computer
terms. Python scripting, however, is definitely a pre-requisite.
This book considers three main tools by which computations and data analysis of
biological data may be performed. These core competencies are the use of a spreadsheet,
the use of a computer language, and the use of a database engine. This text assumes
that the reader has very little experience in using a spreadsheet and no experience in the
programming language or the use of a database engine.
Advanced readers might find this text a bit frustrating as many of the examples
do not use the most efficient coding possible. The purpose of this text is to relay an
understanding of how algorithms work and how they can be employed. While coding
efficiency is an admirable competency, it is not the aim of this text since often the most
efficient codes are more difficult to understand.
Finally, it should be noted that biology is a vast field with many different areas of
research. This text only touches on a few of those areas. It would be nice to write a
comprehensive tomb on computations in the field of biology, but the author simple does
not have that many decades left on this planet.
1
2
Part I
3
Chapter 1
Mathematics Review
Algebra, geometry and trigonometry concepts will be used throughout this book. This
chapter reviews the basic concepts and establishes the notation that will be used in the
following chapters.
1.1 Algebra
where w is the width and h is the height of the square. Frequently, the multiplication sign
is omitted as shown. This equation applies to any rectangle instead of just one specific
rectangle. Thus, the use of variables is more descriptive of problem.
Variables are just one letter and subscripts are used to help delineate similar vari-
ables. Consider the case were there are two squares of different sizes. They are named
Square 1 and Square 2. The widths of the two objects are then described as w1 and
w2 . Thus, w still represents the width and the subscripts associate the variable to the
respective square.
Power terms are used to indicate repetitive multiplications. The square of a value is
5
defined as,
x2 = x × x = (x)(x), (1.2)
x3 = x × x × x. (1.3)
1.1.2 Calculator
Computer operating systems, such as Microsoft Windows, Apple OS X and the various
flavors of UNIX, all come with a calculator. Most of these have various modes including
a scientific mode which contains the trigonometric and algebraic functions.
The calculator from Microsoft Windows is shown Figure 1.1(a) in its default mode.
However there are other modes that are available as shown with the pulldown menu in
Figure 1.1(b). The selection of the scientific mode presents a different calculator which is
displayed in Figure 1.1(c). This offers trigonometric and algebraic functions.
1.1.3 Polynomials
y = ax2 + bx + c, (1.5)
6
where a, b and c are coefficients. The independent variable is x and the dependent variable
is y. Figure 1.2 shows the graph for the case of a = 0.02, b = −0.02 and c = 1. The
variable a controls the amount of bend in the graph, the b controls the horizontal location
and affects the vertical location, and the c affects the vertical location. A description
of this plot is “y vs. x” which displays the dependent variable versus the independent
variable.
The input space is not restricted to a single independent variable. In the case of,
z = x + y − 6xy, (1.6)
the two independent variables are x and y while the dependent variable is z. This creates
a surface plot as shown in Figure 1.3. The x and y axes are the horizontal and axes and
the z corresponds to the vertical axis.
0 = ax2 + bx + c, (1.7)
and in many instances the values of a, b and c are known but the value of x is unknown.
There are actually two solutions to this equation since it is a second order polynomial. A
third order polynomial such as,
7
Figure 1.3: The graph of a second order polynomial with two inputs.
√
−b ± b2 − 4ac
x= , (1.9)
2a
where the ± can be either + or -. One of the two solutions uses + and the other uses -.
It is also possible that a solution to Equation (1.7) does not exist. In this case
b2 − 4ac is negative and the square root can not be computed.
Example.
Consider a = 0.303, b = 0.982 and c = −0.552. The solutions to Equation (1.7) are:
p
−0.982 ± (0.982)2 + 4(0.303)(0.552)
x= (1.10)
2(0.303)
8
1.2 Geometry
Figure 1.4 shows a rectangle with two linear dimensions. In this example there is a width
and a height. Both of these are lengths and the units for lengths is commonly in inches,
feet, miles, centimeters, meters or kilometers.
The area of a rectangle is the width times the height. For the example in Figure 1.4
the area is,
A = hl. (1.11)
The triangle shown in Figure 1.5 is a right triangle which subtends half of the area
of a rectangle of the same height and width. Therefore the area is
1
A = hl. (1.12)
2
Since this is a right triangle the lengths of the sides are related by the Pythagorean
theorem. In this example,
p 2 = h2 + l 2 . (1.13)
Actually, any triangle subtends half of the area of the enclosing rectangle. Figure
1.6 shows a triangle that does not have any right angles but the area of this triangle is
still half of the area of the enclosing rectangle as in,
1
A = ab. (1.14)
2
9
Figure 1.6: A non-right triangle.
This property is easy enough to demonstrate with Figure 1.7. The original triangle is
sections II and III. However, within the enclosing rectangle there are two other triangles
of equal areas to the originals. Region I and II form a rectangle, and since the area of
a right triangle is half of the area of the triangle, both I and II must be half of the area
and therefore have equal areas. The same applies to regions III and IV . Thus, the total
area from II and III must be the same as the total area of I and IV , and finally the area
of II and III must be half of the area of the enclosing rectangle.
A = πr2 , (1.15)
10
The area is just the outside of the object. The volume includes the interior. The
volume of a cube, such as the one shown in Figure 1.9, is the product of the three linear
dimensions,
V = abc, (1.17)
and the volume of a sphere is,
4
V = πr3 . (1.18)
3
The volume of the cube shown in Figure 1.9 could also be considered as the area of
one face (A = ab) multiplied by the linear dimension that is perpendicular to the face,
V = Ac = abc. (1.19)
The same logic is applied in computing the volume of a cylinder.. This is the area of
the circle multiplied by the length of the perpendicular side. The volume of the cylinder
shown in Figure 1.10 is,
V = Az = πr2 z. (1.20)
11
1.3 Trigonometry
1.3.1 Coordinates
Figure 1.11 shows a data point plotted on a graph. There are two common methods of
reference this point. In rectilinear coordinates the point is denoted by the horizontal and
vertical distances (x, y). In polar coordinates the point is referenced by its distance to the
original and the angle to the horizontal axis, (r, θ).
There are other coordinate systems as well, but they all have one feature in common.
Since this point is in R2 (two-dimensional space) the representation of this point requires
two numerical values.
1.3.2 Triangles
A triangle is formed from the data point and the origin in Figure 1.11 This is a right
triangle which has several convenient properties. Figure 1.12 displays a right triangle with
side lengths of a, b and c. The Pythagorean theorem relates the length of the sides by
c2 = a2 + b2 . (1.21)
b
cos(θ) = , (1.23)
c
and
12
c
a
θ
b
Figure 1.12: A right angle triangle.
a
tan(θ) = . (1.24)
b
Likewise, the inverse functions are:
a
θ = sin−1 , (1.25)
c
−1 b
θ = cos , (1.26)
c
and a
θ = tan−1 . (1.27)
b
Figure 1.13 shows a different triangle that does have not a right angle. The sides
and the angles are related by two laws. An angle is related to the sides by the law of
cosines,
c2 = a2 + b2 − 2ab cos(γ). (1.28)
and the law of sines,
a b c
= = (1.29)
sin(α) sin(β) sin(γ)
β
a c
γ α
b
Figure 1.13: A triangle.
A vector is depicted as an arrow from the origin to a designated point in space such as the
one shown in Figure 1.14. Numerically, a vector is a one dimensional array of numerical
values. A matrix is a two dimensional array of numerical values. This section reviews of
the basic processes associated with vectors and matrices.
13
Figure 1.14: A vector.
1.4.1 Elements
~v = 1x̂ + 3ŷ,
~v = 1î + 3ĵ, or
~v = h1, 3i.
The î and x̂ are the same and just mean that this dimension is in the x, or horizontal,
direction. Likewise, ŷ and ĵ are the same and represent the vertical dimension. The in
the case of a three dimensional vector either ẑ or k̂ are used.
1.4.2 Length
The length of the vector is the hypotenuse along the triangle. Thus, Equation (1.21),
Pythagorean theorem, is used to compute the length of a vector.
14
1.4.3 Addition
The addition of two vectors is simply the addition of respective elements. Give two vectors
w
~ and ~v , the addition is,
~z = w
~ + ~v (1.30)
z i = wi + v i , ∀i = 1, ..., N (1.31)
where N is the length of the vectors, and the ∀ symbol means “for all”. Thus, the addition
is applied to all of the elements in the vector. Subtraction is similar except that the plus
sign is replaced by the minus sign.
Geometrically, the addition is shown as the correct placement of vectors. The ad-
dition of two vectors is shown in Figure 1.15. The tail of vector ~x is placed at the tip
of vector w.
~ The summation is ~z which starts at the tail of w ~ and ends at the tip of ~x.
Figure 1.16 shows the subtraction ~z = w ~ − ~x. The vector ~x is now reversed in direction
since it had that negative sign. The result is still the vector from the tail of w
~ to the tip
of ~x.
15
1.4.4 Multiplication
Elemental multiplication,
Outer product, or
Cross product.
z i = wi x i , ∀i. (1.32)
The inner product (also called a dot product), creates a scalar value, as
~ · ~x
f =w (1.33)
N
X
f= wi x i . (1.34)
i=1
Mi,j = wi xj . (1.35)
The cross product creates a vector that is perpendicular to both input vectors,
~ × ~x.
~z = w (1.36)
1.5 Problems
16
3. What is the value of x3 if x = 3?
√
4. What is the value of x if x = 49?
11. If the radius of a circle doubles, does the area also double?
17. Given the triangle in Figure 1.12 with a = 1 and c = 3. What is the angle θ?
18. Given the triangle in Figure 1.12 with a = 1 and b = 3. What is the angle θ?
19. Given the triangle in Figure 1.12 with a = 1 and θ = 30◦ . What is c?
17
18
Chapter 2
Scientific Writing
2.1 Content
Most authors understand that the written document needs to follow basic language guide-
lines. The ensuing sections review some of the guidelines that are unique to scientific
writing.
2.1.1 Presentation
Except for rare occasions, scientific documents should be written in the third person. The
author is the observer and not the participant in the experiment. Therefore, the point of
view should not include words such as “I” or “we.”
2.1.2 Figures
Figures are common in documents and there are a few rules that should be heeded. First,
a figure should never be isolated in the document. The text must have a reference to
every figure. Second, the reader should not be required to interpret the figure to draw the
conclusions that the writer wishes to relay. The author must describe why the figure is
important and what is in the figure that proves their contentions.
All figures need a caption and a figure number. This caption is below the figure.
An example is shown in Figure 2.1 which shows the effect after a particular type of mint
is dropped into a particular bottle of soda.
19
Figure 2.1: A delightful experiment with soda and a mint.
2.1.3 Tables
Tables are treated in a similar manner to figures except for the location of the caption
which is at the top of a table. Once again, the text must have a reference to the table,
and content as well as the importance must be discussed. It is improper to state “the
table shows that the experiment is validated.” Instead, the author needs to explain how
the contents of the table validate their point.
An example is shown in Table 2.1 which shows the results from three experiments.
Experiment Result
1 3.423
2 6.432
3 9.243
2.1.4 Equations
20
In all cases, the equation is treated as part of the sentence. Thus, if an equation is the
last component of a sentence then a period must follow it.
All variables must be defined near the location where they are first encountered. For
example, in Equation (2.1), m is mass, c is the speed of light, and E is the rest energy of
that mass. Variables are presented in italics both in the equation and in the text. The
major exception is that matrices and tensors tend to be presented in bold, upright fonts.
Units, on the other hand, are presented in upright characters. For example, the mass of
an experimental object is written as m = 1.5 kg.
The derivative symbol, d, in calculus equations is upright as in,
df (x)
g(x) =
dx
or Z
f (x) = g(x) dx.
There are several software packages that can be used to create scientific documents. This
section highlights the advantages and disadvantages of the different choices.
2.2.1 MS - Word
Microsoft-Word is the most popular program used in writing documents. The advantages
are:
It is expensive.
21
Inline equations can not be made to look exactly like centered equations.
2.2.2 LATEX
The LATEX program is a layout manager and not a word processor. Using LATEXbasically
requires learning a computer language. The advantages are:
It makes professional looking documents. (This book was written using this soft-
ware.)
It has been around for decades and so there is a massive reserve of libraries to type-
set almost anything. There are libraries for IPA (international phonetic alphabet),
music, chemical reactions, etc.
Most serious scientific journals prefer LATEXover MS-Word and do provide a template.
It has a fantastic equation editor. Many websites (including Piazza.com) that allow
users to create equations are using LATEX.
Websites like Overleaf.com allow for multiple authors working simultaneously on the
same documents.
Excellent management of citations using bibtex. Many journals provide bibtex for-
matted citations.
Most editors are not WYSIWYG. Users must compile documents to see how they
will appear.
MS-Windows: MikTex
OSx: MacTex
22
UNIX: Use the package manager to download the compiler. The Kile editor is very
popular.
Since LATEXis a layout editor a few lines are required for any document. Code 2.1
shows the basic commands. Line 1 declares the document to be an article. Other options
include book, slides, letter, etc. Many other templates are available and most journals
or universities provide LATEXtemplates. Line 2 is a comment field that is not used in
compiling. Line 3 begins the body of the document. Line 4 is the text that is actually
seen in the document and Line 5 ends the document. Anything that is added to the file
after \end{document} is not considered by the compiler.
2.2.2.1 Packages
LATEXhas many packages that can be loaded to create the correct type of document.
Popular packages are:
fullpage : Allows the document to fill the page with smaller margins
subfigure : Allows for the inclusion of multiple files in a single image in the document.
listings : Tools to include source code from many languages including color coding
and line numbering.
These are usually placed at the top of the document as shown in Code 2.2. There
are thousands of packages freely available that manage various types of typesetting.
23
Code 2.2 Inclusion of packages.
11 \ begin { document }
12 ...
2.2.2.2 Title
A title is easily created as shown in Code 2.3. The title usually contains the title name,
the author and the date. In this example these three are established in Lines 3, 4 and 5.
Line 8 places the title information at this location in the document with the \maketitle
command.
7 \ begin { document }
8 \ maketitle
9 ...
The font size is established in line 1. The command \small starts the use of a
smaller font. The command \em creates italic text. The \copyright creates the ©
symbol. The \today command inserts the date when the file is compiled.
2.2.2.3 Headings
Headings are easily created using several commands depending on the heading level. Ex-
amples are shown in Code 2.4. Line 5 creates a new chapter heading. This books uses the
default styles for chapter headings. Chapter headings are available only if the document
24
class is a book. If the document class were an article then line 5 would not be allowed.
Line 6 starts a section heading and lines 7 and 8 create subheadings. The headings are
automatically numbered including the chapter number if the document is a book. Line 9
uses the asterisk to suppress heading numbering for this section.
4 \ begin { document }
5 \ chapter { Chapter Name }
6 \ section { Section Name }
7 \ subsection { Sub Section Name }
8 \ subsubsection { Interior Section Name }
9 \ section *{ Section without Number }
10
11 ...
Cross references are links within a document to another location. It is possible to link to
a figure, table, equation, heading or other parts of the document. LATEXuses the \label
command to identify locations that can be referenced and \ref to link to that reference.
For example, the goal is to create a link in the text that refers to a different section in the
document. Line 2 in Code 2.5 creates a section heading and attaches the label se:title1
to it. Later in the document this is referenced as shown in Line 7. When the document
compiles the text will replace the reference with the section heading number.
1 ...
2 \ section { Title 1}\ label { se : title 1}
3 Text inside of this section
4
5 \ section { Title 2}
6 Text inside of this section that needs to refer to
7 Section \ ref { se : title 1}.
8 ...
LATEXis a two pass compiler. In the first pass the labels are found and stored in an
auxiliary file. In the second pass this file is then used to connect to the references in the
text. So, it is necessary to compile the document twice to make all of the connections.
25
Some environments such as MikTex performs both passes without user intervention. Most
other user interfaces require that the user compile the document twice. The presence of
two question marks indicate that a cross reference is not made. These exist at the location
of a reference. This means that the partner label does not exist, their is a typo in either
the label or reference, or that the user needs to run the compiler again.
Figures can be added to LATEXdocuments in two fashions. The first is to use a package
such as Tikz which allows the user to make drawings with programming commands. While
this is an extremely powerful tool, it also has a steep learning curve. The second method
is to load an image file that was created through any other means and stored on the hard
drive. An image an be inserted using the \includegraphics command. An example is
shown in Line 4 of Code 2.6. This has the additional argument of reducing the image size
by a factor of 2. This command inserts the image from the file myfile.jpg.
The code shown does more than just inserting an image. Line 2 begins a figure
region which is dedicated real estate for this image. It is a floating object and so LATEXwill
place it the optimal location so that there are no large blank regions in the document. The
argument [htp] controls this placement indicating that the placement should be here and
if that is not plausible then on the next page. Line 3 will center the figure horizontally on
the page. Line 7 creates the caption and Line 8 creates the cross reference label. There
are many more options that can be used to place the figure, wrap text around the figure
and create subfigures.
1 ...
2 \ begin { figure }[ htp ]
3 \ centering
4 \ includegraphics [ scale =0.5]{ mydir / myfile . jpg }
5 \ caption { My caption .)
6 \ label { mg : myimage }
7 \ end { figure }
8 ...
2.2.2.6 Equations
The most powerful feature of LATEXis the ability to professional looking equations. The
language used in creating equations is the standard in the industry. Many websites now
use LATEXscripting to create equations. Websites such as http://www.sciweavers.org/
free-online-latex-equation-editor that allow the user to generate equations with
26
pull down menus and see the LATEXcoding. Packages such as MathJax allows websites to
generate equations as the user views them.
Inserting an equation is very easy. An inline equation is surrounded by single dollar
signs (or \( \)). Centered equations are surrounded by double dollar signs (or \[ \]).
Numbered equations use \begin{ equation } and \end{ equation } as shown in Code
2.7. This equation is
E = mc2 (2.2)
1 ...
2 \ begin { equation }\ label { eq : emc 2}
3 E = m c ^2
4 \ end { equation }
5 ...
LATEXwill automatically number the equations. For a book document the numbering
will also include the chapter number as does this example.
The library of possible symbols is enormous so only a few items are listed here.
Lower case Greek letters use a backslash and spell out the symbol’s name. Example
\alpha produces α.
Upper case Greek letters use the same method but the first letter of the Greek letter
is capitalized. Example \phi produces φ and \Phi produces Φ.
The capability of LATEXto create equations is enormous. Beginners will find benefits
from the Sciweaver website to use the pulldown menus to create equations to help learn
the LATEXlanguage.
27
Table 2.2: My Table
A B C
1 4 4
3 6
2.2.2.7 Tables
There are two keywords used in creating tables. The tabular keyword is used to construct
the grid and contents and the table keyword is used to place the contents in a nice table
perhaps centered on the page with a caption.
An example is shown in Code 2.8. In Line 5 the tabular command is used. Following
that is a code the indicates that there are three columns (three letters), the first column
is centered, the second is left justified and the last is right justified. The vertical lines
indicates that there will be vertical line separators before and after each column. The
table begins in Line 6 with \hline. This creates a horizontal line. Line 7 creates the
first line of items with each column separated by & and the final entry followed by two
backslashes. Line 8 produces a double horizontal line. The following lines finish the table
and it is shown in Table 2.2.
1 ...
2 \ begin { table }[ htp ]
3 \ centering
4 \ caption { My Table } \ label { ta : mytable }
5 \ begin { tabular } { | c | l | r | }
6 \ hline
7 A & B & C \\
8 \ hline \ hline
9 1 & 4 & 4 \\
10 3 & 6 & \\
11 \ hline
12 \ end { tabular }
13 \ end { table }
14 ...
This may seem to be a very cumbersome method of creating a table compared to the
point and click methods used in word processors. However, the truth is just the opposite.
If a program is written to generate data that needs to be put into a table then the program
can also be made to include the ability to generate a text string that is the LATEXcoding
for a table. In other words, the user writes a program to make the computations and it
also produces a string such as the text shown in Code 2.2. Then the user can simply copy
28
this string into their LATEXfile. If the user is generating several tables then this method
can be exceedingly faster than placing items into cells, one at a time, by a mouse.
2.2.2.8 Bibliography
LATEXalso has a very nice method of generating a bibliography. Citations are placed in a
single text file and bibtex is used to generate the citations and their links. Many journals
provide bibtex formatted citations on their websites. Code 2.9 shows the entry for a
journal article. This file should be named with a .bib extension. For example, the file that
contains the citations is named cites.bib. It should be noted that there is no indication as
to how the citations are to be presented in the document, but this is merely the citation
information.
1 @article { Hodgkin 52 ,
2 author = { A . L . Hodgkin and A . F . Huxley } ,
3 title = { A Quantitative Description of Membrane Current and
4 its Application to Conduction and Excitation in Nerve } ,
5 journal = { Journal of Physiology } ,
6 volume = {117} ,
7 pages = {500--544} ,
8 year = {1952}
9 }
In the LATEXdocument, usually at the end, the bibliography is created. Code 2.10
shows the two lines that are used. The first line indicates the style which is named alpha
in this case. Many other styles are available and some journals even provide a template
for their style. The user simply replaces alpha with the desired style. Line 2 actually
places the bibliography in the document at this location. The word cites indicates that
the information is in a file name cites.bib.
Finally, it is necessary to place the reference to the citation in the text. The keyword
cite performs this task as shown in Code 2.11. This citation references Hodgkin52 which
is the name of the citation from Line 1 in Code 2.9. Only the citations that are cited in
the text will be placed in the bibliography. So, it is possible to have a large file will all
citations from many projects in the cites.bib file, but only those that have a cite reference
will be printed in the back of the document.
29
Code 2.11 Creating the citation reference.
There are many citation managers such as JabRef which provides an easier interface
for entering the citation data.
LATEXis an exceedingly powerful tool for creating professional documents. The description
provided here is merely the tip of the tip of the iceberg.
2.2.3 LibreOffice
LibreOffice provides an office suite at no cost. It is not quite as powerful as MS-Word but
does have advantages of its own.
The advantages are:
It is free.
It has a good equation editor and an add-on will allow for LATEXequation editing.
It is available for any platform: Windows, OSx or UNIX.
It can read and write MS-Word documents, but complicated documents do not
translate without problems.
No journal accepts open documents. Although some are accepting PDFs which
LibreOffice can generate.
Some features are missing on the program that makes slides (similar to PowerPoint).
2.2.4 Others
There are other document creation systems that are available but tend to lack the ability
to make scientific documents.
Google Docs has the advantages of being free and allowing multiple writers to concurrently
work on a single document. However, it does very poorly in creating equations, managing
headers, managing citations or cross references.
30
2.2.4.2 ABI Word
ABI word is freely available for all platforms, but it has limited performance. The issues
are similar to those in Google Docs.
2.2.4.3 Zoho
Zoho is a cloud based office suite that offers features but has traditionally been slow to
use.
2.2.4.4 WPS
WPS (formally known as King Soft) from China that has the look and feel of MS-Office.
It runs on all platforms including smart devices.
31
32
Chapter 3
Spreadsheets have been a staple in office software for decades. They are excellent tools
for organized data, performing some computations and creating basic graphs. Microsoft
Excel and LibreOffice Calc are two spreadsheets that have sufficient tools for the analysis
tasks in this text. There are other packages but they tend to lack the ability to create
plots and analysis the data therein.
This chapter will review some of the very basic aspects of performing computations
in a spreadsheet. MS-Excel and LO-Calc tend to behave similarly and so the examples
are shown only for MS-Excel.
+: Addition
-: Subtraction
*: Multiplication
/: Division
%: Modulus
33
Figure 3.1: A simple calculation.
Typing in values as in Figure 3.1 though is not exceedingly useful as any calculator can
perform such a function. Spreadsheets become more useful with the ability to reference a
value in a cell. Consider the task of adding a value of 8 to the value in another cell. In
order to perform this task the formula needs to reference the contents in this other cell.
This example is shown in Figure 3.2. In cell A1 there is a value of 29 and the goal is to
add 8 to this value and place the answer in cell B1. In B1 is the formula =A1 + 8. The A1
is the identity of the first cell and so the contents of that cell are used in the computation
of the value of B1. Once the ENTER key is pressed the value of 37 will appear in the cell
B1. However, if the value of A1 is changed then the value of B1 is automatically changed
to reflect the new computation.
A formula can reference many different cells. An example is shown in Figure 3.3 in
which the computation in cell C1 uses the values in A1 and B1. Again if either of these
values are changed then C1 is automatically updated.
34
Figure 3.3: Referencing the contents of multiple cells.
A formula with references can be copied to many different cells and the references will
automatically change. Consider the task of creating an list of incrementing values as shown
in Figure 3.4. This is a small list, but if the task was to have a list that is 1000 cells long
then typing them in by hand is too tedious. A more efficient manner is to use a formula
with a cell reference. In this case the value of 1 is typed into cell A1. Then in cell A2 the
formula = A1 + 1 is typed in. When ENTER is pressed the value of 2 will appear in A2.
The next step is to copy and paste the formula into cell A3. This can be done by either the
copy and paste routine or using the fill down option from the spreadsheet menu. When
the formula is copied in this manner the formula in cell A3 will automatically change to
= A2 + 1. This is called a relative reference.
To copy to multiple cells the user can copy a cell with a formula, paint many cells,
35
and then paste. The formula will be copied to all of the cells that were painted and in
each one the formula will adjust the cell reference. The second method is to paint all of
the cells that are to receive the formula and the cell that has the formula. In this case
mouse would be used to paint cells A2 to A15. Then the fill-down option (control-D) is
used and the formula in A2 will be copied downwards into cells A3 to A15.
As seen in the example, there is a list of incrementing numbers. The cursor is placed
on cell A7 and the formula in the window above column E is shown. To create a column
of 1000 incrementing numbers the only difference is that the user would paint cells A2 to
A1000 before pasting or filling down. If the value in A1 is changed then all of the values
in the column are changed accordingly.
Consider a case that uses the same column A from the previous example and will multiply
every value by 10. A poor example is shown in Figure 3.5 in which the value of 10 is
copied into cells B1 to B15. Now, the task is to multiply the value in column A with the
value in column B. The formula = A1 * B1 is entered into cell C1 and then copied into
cells C2 through C15. The result is as shown and the goal is accomplished.
However, this is not a very efficient manner in which to perform this computation.
If for, example, the value of 10 needed to be adjusted to a value of 9.8 then all of the cells
in column B would need to be changed. With copy and paste this is not an impossible
task, just an annoying one.
A better solution is to use an absolute reference. Consider the example shown in
Figure 3.6. There is only a single entry in the B column and the desire is to have all
formulas in column C reference that single cell.
36
Figure 3.6: A better way of creating several similar computations.
The dollar sign in a reference means that the reference can not change. Thus a
reference to cell B$1 would prevent the 1 from changing. All formulas in column C would
reference cell B1 as shown in Figure 3.7.
A dollar sign in front of the letter in a cell reference would prevent that from chang-
ing. Thus, $B1 would allow the 1 to change but not the B. Finally, $B$1 would prevent
either the column or the row designation from changing.
37
3.2.3 Cell Names
While referencing a cell by column and row designation is useful, it is possible to apply a
different name to a cell. Consider the task of computing the distance an object falls. The
equation for this is,
1
y(t) = gt2 , (3.1)
2
where y(t) is distance fallen as a function of time, t is time, and g is the gravitational
constant. The problem is set up the same way as the previous example. Column A is the
different times in which the computations are made measured in seconds. The gravitational
constant is g = 9.8 m/s2 (meters per second per second) and this value is placed in cell B1.
Before the computations are completed the name of the cell B1 is changed to ‘gravity’.
Above column A there is a window which normally has the designation of the cell such as
‘B1’. The user can override this designation by typing the new name in this window as
shown.
Column C will contain the values of y(t) for each time t in column A. The formula
= 0.5*gravity*A1^2 is typed into cell C1. The designation ‘gravity’ is used instead of
‘B1’. This formula is then copied to all of the cells needed in column C. These values are
the distance that the object has fallen (in meters) for each time in column A.
Spreadsheets have a plethora of functions that can be applied to the data in the cells.
This section will only review a few of these functions, but users should be aware that the
38
Figure 3.9: Using the named cell in referenced computations.
library of functions is quite large and the library should be scanned so that the available
functions are familiar to the user.
The SUM function adds up the values in a specified region. An example is shown in
Figure 3.10 which has a column of values from cell B1 to B16. The sum is to be computed
and placed into cell B17. The function is written as =SUM(B1:B16) which adds up all of
the values in the given range. When the ENTER key is pressed then the value of the sum
is shown in cell B17, and if any of the values in the data are changed then the sum is
automatically updated.
The most common computations for statistical are the average and standard deviations.
The function for the first is AVERAGE and for the latter is STDEV. For the example,
the user would type into cell B18 the formula, = AVERAGE(B1:B16) and in cell B19
=STDEV(B1:B16). Again, if the data values are updated then the values of the com-
putations will also be updated.
Consider the task of finding the data values that are greater than the average. The
average has already been computed and so this task merely needs to find the values that
exceed a threshold. This can be accomplished with the IF statement, which has three
39
Figure 3.10: Computing the sum of a set of values.
40
arguments. The first is the comparison. The second and third parts are the action to be
taken depending on whether the condition is true or false.
Consider the example shown in Figure 3.12. The statement is constructed in cell
C1. If this value in B1 is greater than the average (which is in cell B18) then a 1 will be
placed in cell C1. If the condition is false then a 0 will be placed in cell C1. In this case,
the dollar sign is used because this function will be copied into cells C2 through C16 and
all will use the value in B18.
Figure 3.13 shows the result after this formula has been copied into cells C1 through
C16. Those cells with a 1 indicate that the corresponding value in the B column is greater
than the average. The formula in cell B17 is copied into C17 to compute the sum of
column C, which is also the number of data values that were greater than the average.
If the data is changed then the average is updated and so are all of the values in the C
column.
As in many cases, there is an easier way. The COUNTIF function will count the
number of cells that are true for a given condition. The example is shown in Figure 3.14
in cell B21. The COUNTIF function has two arguments. The first is the range of data values
to be considered, and the second is the condition which is in quotes. This will count the
number of cells in range that have a value greater than 4.3125. When the ENTER key is
pressed the count of 6 will appear in the cell.
41
Figure 3.13: Copying the formula to cells in column C.
42
3.4 Creating Basic Plots
Spreadsheets do come with the ability to create some types of charts and graphs. This
section will review the methods of creating a line graph and a scatter graph. The spread-
sheets offer several other types of graphs, but as the methods of creating the graphs are
all similar only two types are shown here.
The first example is a simple line plot as shown in Figure 3.15. Data to be plotted is
placed in column A. Then the tab named INSERT is selected (see the top of the image).
The 2-D Line option is selected and a menu appears that has a few selections. In this
case the first one is selected and the chart appears on the screen. The spreadsheet has
automatically determined the range for both axes. There is also a “Chart Title” which
can be changed by double clicking on the Title.
This chart assumes that the data is in order and are the heights for data points
that are equally spaced in the horizontal direction. There are cases in which the user
has points to plot. They have a set of (x, y) values and the values in the x axis are not
equally spaced. For this case a Scatter Plot is used. The example is shown in Figure 3.16.
Each row is a data point that is to be plotted with column A containing the x values and
column B containing the y values. In this case the Scatter Plot choice is selected and
again a menu appears which provides the user with several options. The one chosen here
43
creates a smooth curve through the data points.
The data does not fill the chart window. The spreadsheet has determined the ranges
for both axes and these may be changed by the user. In Figure 3.17 the horizontal range
is altered. The user double clicks on the horizontal axis and a new menu appears. At the
top of the menu are the choices for the beginning and ending of the horizontal range, and
these can be changed manually. In this case that range is changed and the graph is altered
accordingly.
Components of the chart can be altered by the user usually by double clicking on a
region in the graph. The title can be altered in this fashion. The appearance of the axes
can be altered as shown. The color and markers of the data plot can be altered by double
clicking on the plotted data (see Figure 3.18). The background and grid of the plot can
be changed by double clicking on the graph background.
Spreadsheets such as Excel and Calc have tools to estimate the functional form of a graph.
One tool is called Trendline which can be used for functions following a basic form (such as
a polynomial or log function), and the second is Solver which can handle much complicated
44
Figure 3.17: Altering the x axis.
functions.
3.5.1 Trendline
The Trendline tool can estimate the parameters of a function as long as the function
is from a specific selection of formulas. Consider the graph in Figure 3.18 that shows
exponentially increasing data points. A right click on the graphed data will recall a popup
menu. There are several options and the one of interest is labeled “Add Trendline.” When
this is selected a new interface appears like the one shown in Figure 3.19. The first step
is that the user must select the correct functional form. If the data is linear then the user
should select the Linear option. The data in Figure 3.18 is not linear but instead is rapidly
rising as does an exponential function. Therefore, the Exponential option is selected.
At the bottom of this interface are two selections that are quite useful. The next to
last option will display the estimated function and the last option will display a measure
of the goodness of fit. These are both displayed in Figure 3.20. Trendline creates an
exponential function with estimated parameters. In this case, the estimated function is,
y = 10e0.0797x .
That is a perfect fit for this data and so R2 = 1. If the fit was less than perfect then
R2 would be less than 1. There is also a thin blue dotted line that plots the estimated
function but as this lines exactly on top of the plotted data it is hard to see.
45
Figure 3.18: Accessing the Trendline tool.
46
Figure 3.20: Perfect fit trendline.
A second example is shown in Figure 3.21 which is a similar case except that noise
has been added to the data. Thus, the data is no longer a perfect exponential function.
The Trendline process estimates this data to follow the function,
y = 16.289e0.678x .
The R2 value is less than 1 but still quite high indicating that this function fits the data
well. The blue dotted line is now visible and it displays the estimated function along side
the actual data (solid line).
Figure 3.19 shows that there are several functional forms available which are expo-
nential, linear, logarithmic, polynomial, power and moving average. The user is responsible
for selecting the correct form to match the behavior of the data. The incorrect selection
will result in a very poor fit.
3.5.2 Solver
Trendline does work well for the functions in the list, but does not work well for more
complicated functions such as a Gaussian (bell curve). For the more complicated functions
47
Excel and LibreOffice Calc offer a Solver function that can estimate the parameters of a
function that fits the data.
An example fits the data with a Gaussian function with the form,
2 /2σ 2
y = Ae−(x−µ) , (3.2)
where A is the amplitude, µ is the x location of the Gaussian peak, and σ is the half width
of the peak at half height. For this example A = 1 and so the only two parameters are µ
and σ. The raw data is shown in Figure 3.22 which is created by using σ = 3, µ = 0.75
and some random noise is added.
In an actual experiment the values of µ and σ are not known and it is the goal of
Solver to determine these two values that best fit this data. Using Solver requires a bit
more set up work than Trendline. A typical use is shown in Figure 3.23 where the raw x
and y values are in the first two columns. There are 70 rows of data and and this image
only shows the first few rows. Column C contains the two variables µ and σ in cells C3
and C5 respectively. Initially, these values are not known and they are set to 1. Column
D shows the calculated results using equation (3.2) with the two values of µ and σ from
column C. The equation used in cell D2 is shown in line 1 of Code 3.1. Column E is the
squared error between the measured data (column B) and the calculated data (column
D). The Excel command used in cell E2 is shown in line 2 of Code 3.1. The difference
between the measured and calculated data is squared to remove any negative signs and to
accentuate those cases where the difference is large.
48
Figure 3.23: The spreadsheet architecture for Solver.
Initially, this error is large because the correct values for µ and σ are not known.
The final cell is the sum of the errors which is in cell G2. The equation for this cell is
shown in line 3 of Code 3.1. Since all of the squared errors are positive values the only
way that cell G2 can be zero is if all of the squared errors are zero and this occurs if the
calculated and measured data match exactly. Since there is noise in the data a perfect
solution is not possible, so the Solver will attempt to minimize the error in G2 by changing
µ and σ.
It is possible for the user to manually change these values and keep the changes if
the value of G2 is decreased. Basically, Solver will do the same thing in a much faster
manner. The Solver is accessed by clicking on Data in the menu in the upper ribbon and
then Solver in the submenu. Figure 3.24 shows the dialog window that appears. In the Set
Objective window G2 is entered since this is the cell that is to be minimized. Furthermore,
the Min button is selected. Finally, the By Changing Variable Cells window contains the
cells that are to be altered and in this case that is cell C3 and C5. Finally, the Solve button
at the bottom of the window is pressed and Solver computes new values for µ and σ.
The computed values are µ = 3.00581 and σ = 0.77699 which are very close to
the values used to generate the data. Had there been no noise then Solver would have
recovered the exact values for µ and σ. The final squared error is 0.140. Since the values
of µ and σ are now changed the values in columns D and E are also changed. Figure 3.25
shows the new values of column D plotted along with the original data. As seen there
is a fairly close match and thus the Gaussian function estimate of the measured data is
complete.
Solver is much better suited for problems that Trendline can not solve. It is im-
portant in each case to make sure that the answer provided by the algorithm matches
the data. The Solver will return an answer but in some cases the answer may not be
sufficiently correct. This is a common issue with these types of algorithms where they can
not home in on a solution or there is something in the data that prevents the algorithm
from finding an acceptable solution. If the solution is insufficient then the user needs to
identify if there are data points that violate mathematical rules (square root of a negative
number, divide by zero, etc.) and remove them. If there are a lot of data points then
another approach in finding an appropriate solution is to perform the curve fit on a subset
of the data.
49
Figure 3.24: The Solver interface.
Figure 3.25: Plots of the original data and the Solver estimate.
50
Problems
4. In an spreadsheet cell convert the angle 45◦ to radians using the RADIANS function.
5. Create a column of 1000 numbers in which each number is the sum of the two
numbers directly above it. The first two numbers in the column should be 0 and 1.
6. In a spreadsheet column create 1000 random values (using the RAND function). In
the 1001 cell compute the average of these values.
7. Using the example Section 3.2.3, compute the y(t) values for an object on the moon
in which the gravity is only 1.68 m/s2 .
8. The equation for a falling object that has an initial speed of v0 is y(t) = v0 t + 12 gt2 .
Modify the example in Figure 3.9 so that it includes an initial speed of v0 = 1.3 m/s.
Place the value of v0 in cell B2 and use that cell in the new computations in column
C.
10. Create a scatter plot for the following data (0,0.01), (0.2, 0.034), (0.4,0.15), (0.7,0.5),
(0.9,1.0), (1.1,1.3), (1.4,2.0), (1.8,3.15), (2.0, 4.01).
11. Use Trendline to find the function that best fits the data in the previous problem.
51
52
Chapter 4
Gene expression arrays are biological experiments that can gather information about the
content of a sample for thousands of genes. This data is collected and available as spread-
sheets from the NIH. The experiment used here gathers information about about 800 genes
for healthy men and women. This chapter will use the tools in a spreadsheet to analyze
this data.
4.1 Data
A gene expression array is a small plate with samples of hundreds of genes attached in
an array of small spots. A sample with perhaps unknown DNA contents is washed over
the plate and if the DNA attached to the plates is similar to the DNA in the input then
the two will adhere. The input has a dye attached to it that can be detected through
optical means. The quick description is that if the sample on the plate mates with the
input sample then the sample on the plate will also collect an amount of dye.
These are delicate experiments and so it is difficult to exactly replicate the same
experiment. The solution is that each experiment has two input samples each with different
dyes. Instead of analyzing the amount of dye at each sample spot, the researchers analyze
the ratio of the two dyes.
The data used in this chapter is obtained from https://www.ncbi.nlm.nih.gov/
geo/query/acc.cgi?acc=GSE6553. The code to the right of the colon indicates which
samples are used in the file and in which order. The ‘F’ indicates that the sample was
from a female and the ensuing numerical value is the ID. So, ‘F51’ is a sample from a
particular female. In the first file, the female F51 is the first sample and the male M58 is
the second sample.
53
GSM151669 : M57 M56
GSM151670 : M56 M55
GSM151671 : M55 F53
GSM151672 : F53 F52
GSM151673 : F52 F51
GSM151674 : F53 F51
GSM151675 : F52 M57
GSM151676 : M55 M58
There are three sections to this file. The first is less than 100 rows and provides
information about the experiment. The second section has a large number of rows with
each row associated with a single sample on the slide. This is the data after analysis
and usually what the investigator would use. However, the intent of this chapter is to
demonstrate how to load and analyze the data. Thus, this chapter will use the third
section of data in the spreadsheet. In the file GSM151667 there are 1600 genes and so
there are 1600 rows of analyzed data and about 50 rows of experiment information. Thus,
the raw data starts on row 1650. This shows the detected data and of this there are only
a few columns that will be used here.
The first column is the ID number which is unique to each row. The spots on the
plates are arranged in a rectangular array of blocks. Each block contains a rectangular
array of spots. The next four columns identify the vertical and horizontal position of the
block and the vertical and horizontal position of the spots within that block.
Column F shows the name of the gene. Columns G and H show the (x, y) location
of the spot on the original image obtained from the scanner. Column I is the measured
intensity of the spot. This corresponds to the amount of dye on that spot. However, there
is also a background value and this is stored in column J. This is the data for channel 1
which corresponds to the first item listed in the file name. So, for the first file, channel 1
corresponds to female F51. Channel 2, male M58, intensity and background data is shown
in columns U and V. There are many other columns but they will not be used in this
chapter.
The goal is to find genes that are turned on in one channel but not the other. This
is called an expressed gene and a general rule of thumb is to find those cases in which the
intensity value of one channel is twice (or more) as much as in the other channel. However,
there are many issues that confound this simple comparison. The dyes do not provide the
same illumination for the same sample size, there are a lot of biological and optical issues
54
that affect data collection. Thus, direction comparison is not readily possible.
The rest of this chapter demonstrates one method of performing the analysis. How-
ever, the first step is to copy the pertinent data to a new page in a spreadsheet. Figure
4.2 shows the new page with the six selected columns copied therein.
4.2 Background
The data is collected through an optical detector but the background is not zero. Further-
more, the background signal is not uniform across the plate that contains the samples.
The machine measures the intensity of each spot but also measures the intensity around
the spot and determines that this is the background signal.
The analysis begins with the subtraction of the background signal from the intensity.
This is repeated for every spot in both channels. There are often a few spots that misbe-
have either in the biological process or the detection process and the background signal
can be higher than the intensity signal. For those few cases the data with be discarded in
this analysis.
The subtraction for channel 1 is placed in cell H2 and the command is =IF(C2>D2,C2-D2,1)
which places the subtraction value if the intensity is greater than the background. If that
is not the case then the computation inserts a value of 1. Later in the analysis the log of
the values will be computed and thus the 1 is used here knowing that it will become 0 in
the final steps. The first few rows are shown in Figure 4.3.
55
4.3 Visualization
Commonly, the two dyes are called red and green because those are the colors that are used
on the computer display to represent them. The dyes used are Cy3 and Cy5 as displayed
in C35 and C36 in the original data file. These dyes have peak responses near 570 nm and
670 nm respectively. In the visible spectrum these are wavelengths of yellow and red, but
green is visibly more pleasing and is used for display.
In this data, channel 1 used Cy3 and channel 2 used Cy5. Each spot of data then
has a green and red value. Figure 4.4 shows the R vs G plot which converts the Cy3 and
Cy5 data to (x, y) points.
There are a few issues with this type of display. The first is most of the display is
blank. Often if that is the case in a plot then resolution of data is sacrificed. The second
problem is that the data does correspond well to a 45◦ line. It is expected that most genes
have about the same response as males and females share many genes. If that is the case
then the R and G values should be about the same and therefore the data should crowd
around a 45◦ line, but it does not.
These issues will be addressed, but for now it is recognized that this is not the best
way to display the data. Then intensity of a spot with R and G values is now defined as
their average. This is represented by I = (R + G)/2.
Figure 4.5 shows the data for I and R/G. The plot of the data is shown in Figure
4.6. The x axis corresponds to the intensity of the spot and the y axis corresponds to the
ratio R/G.
This graph provides more resolution for the ratio of the responses of the two dyes.
56
Figure 4.5: The R/G vs I data.
57
An expressed gene is one in which one channel is at least twice as much as the other.
Clearly there are several points that have a vertical value of more than 2. This are the are
spots in which channel 1 is more than twice as much as channel 2. The reverse, though,
is more difficult to see. The cases in which channel 2 is twice as much as channel 1 are
those in which the vertical value is less than 1/2. The nature of this graph does not allow
those points to be easily seen.
The solution is to compute the log of the values. Consider that log2 (2) = 1 and
log2 (0.5) = −1. In a log graph the ratios become linear values which will display expressed
genes equally for either channel. Two new values are defined as,
A = log2 (I)
and
M = log2 (R/G).
The spreadsheet function LOG( v, n ) can receive two values in which v is the input
value and n is the log base. Thus, log2 (x) is written as =LOG(x,2). The values of A and
M are computed in columns N and O and the first few are shown in Figure 4.7. The graph
is shown in Figure 4.8.
The horizontal axis corresponds to the log of the intensity and the vertical axis
corresponds to the log of the ration R/G. Values above 1 and below -1 are now considered
as expressed genes.
If the ratio R/G was 1, which is expected for many genes then the data points would
at y = 0. However, the majority of the data points are not along this line. Instead, at
lower intensities there is a strong bias above that line. Again, collecting this data is not an
exact science and there are biases. One bias could simply be that the dyes react differently
to the illumination. This bias must be removed from the data before expressed genes can
be identified.
4.4 Normalization
LOESS normalization separates the data into small windows and then subtracts the av-
erage of the window for all of the data within it. For example, the data may be separated
into windows of 50 data points. The leftmost 50 points are the first window. The average
58
Figure 4.8: The M vs A plot.
of the points within this window are computed, and this average is subtracted from those
points. This will ensure that the average of each window is zero and will remove the
vertical bias that is currently inherent in the data.
There are two steps involved in employing this normalization in a spreadsheet. The
first is that the data has to be according to the A value. The second is that the average
of a sliding window has to be subtracted from values.
The first task is to sort the data. This is done in two steps. The first is to copy
three columns of data to a new location in the spreadsheet. This will allow the ability
to rearrange the data without disturbing calculations already performed. In this example
there are 1600 genes and thus the calculations in the previous section consumes slightly
more than 1600 lines in the spreadsheet. The copied data needs to be at least 25 rows
below the last row of data. In this example, the data is placed in row 1630. Three columns
of data are needed. These are the gene number, the A and the M data that were just
calculated. This data is sorted on the A data and a portion of that is shown in Figure 4.9.
The gene number is required in order to resort the data in a later step.
The second step of the LOESS normalization is to divide the data into windows
of 50 points. This is a little time consuming in a spreadsheet and so the algorithm is
modified slightly. For each value a window of 50 points will be considered, but this is a
sliding window. The 100-th data point in this case is on row 1731 in the spreadsheet. The
window of 50 points will be the 25 points before and after. So, for this point the average
is calculated from rows 1706 to 1756.
The reason that there are at least 25 empty rows above this data is to make it easier
to perform this computation in the spreadsheet. The first row of data is on line 1631 and
59
Figure 4.9: Sorted data.
the average needs to be computed for the 25 points before and after this. However, there
are no points before this. The spreadsheet calculation of an average will not include cells
if they are empty, and so the calculation of the average for the 25 rows before and after
this first data point will not use the 25 rows before in computing the average. Thus, the
same equation can be used for all rows.
The result is shown in Figure 4.10. The equation placed in cell E1631 is =AVERAGE(C1606:C1656)
and the equation placed in cell F1631 is =C1631-E1631. The value in E1631 is the average
of the 25 rows before and after row 1631. The value in cell F1631 is the value of M with
the average subtracted.
Now the average data falls along y = 0. Genes expressed in channel 1 have a value
of y > 1 and genes expressed in channel 2 have a value of y < −1.
As seen there are spurious points usually at low intensities. Recall that that x = 0
corresponds to the case in which the original intensity is the same as the background. Some
researchers simply discard the spurious data points since they occur at very low intensities
with the belief that it is not possible to detect them accurately or that something has
gone wrong with the spot on the plate. However, there are arguments that there is still
information within these points and discarding them may be throwing away important
information.
In any case, the LOESS normalization has removed the bias and now it is possible
to find the expressed genes.
60
Figure 4.11: Plot of the data with the average removed.
In this data set there are multiple files and finding expressed genes should consider all
pertinent trials. Consider the question of finding the genes that are expressed by males
but not by females. In this case, only the file that had both a male and female should be
used. For this question there are only three qualified files in the set.
The first file has the male information in the second channel and thus expressed
genes would have a value of less than 1. The second file has the male in the first channel
and thus expressed genes should have a value of 1 or greater.
However, in comparing multiple files it must be considered that there are differences
in the experiments that will bias and scale the data.
The process begins with collecting the data. The process of the previous sections is
applied to all files that will be used. Each file is processed to obtain normalized data such
as in Figures 4.9 and 4.10. One of the issues is that this data is sorted differently for each
file and so it is necessary to resort the data according to the gene number.
Figure 4.12 shows part of this data. This shows the first file data files after LOESS
normalization and the data being sorted again according to the gene number,
61
Figure 4.12: A partial view of data from all of the files after LOESS normalization.
Below each column of data the average and standard deviation are computed. The
first values of the first three files are shown in Figure 4.13. Most of the averages are
similar but the standard deviations are not. This means that each experiment had different
sensitivities.
Figure 4.13: The average and standard deviation of the first three files.
The process is to first subtract the average from each experiment. So, the average
of each column is subtracted from the values of that column. The equation in cell B1606
is =B2-B$1603. This is copied for all the files to the right and 1600 rows down to include
all of the genes. The first few values from the first three files are shown in Figure 4.14.
Subtracting the average will not alter the standard deviation. Thus, each file still
has a different range of sensitivity. Since most of the genes are not differentially expressed
it is expected that the standard deviations of the experiments should be the same. To
accomplish this, each value in an experiment is divided by the value of the standard
deviation. This is shown in Figure 4.15 for the first few rows of the first three files. The
formula in cell B3209 is =B1606/B$3207. Again this formula is copied to the right for each
file and copied down for each gene.
Now, each file has the offset and bias removed allowing the files to be compared to
one another. It is now possible to pursue a question such as: which genes are expressed
62
Figure 4.15: The data after division by the standard deviation.
In this case thee search is for genes expressed in males but not females. Since the
first file put the male in the second channel the search is to find values in that column
that are less than -1. The search also wants values in the second column greater than 1
and value less than -1 in the third column.
A partial result is shown in Figure 4.17. The formula in cell G2 is =IF(C2<-1,1,0)
which tests for the values of less than -1 in column C. If this is True then the value of 1
appears in the cell. The other two values have appropriate tests in columns H and I.
In a perfect world, and expressed gene would appear as 1’s in all three columns. In
a real world, it is expected that the results will not be so clear. Column J sums the three
63
testing rows and the any value of 2 or 3 can be considered as an indication of an expressed
gene.
In this experiment, each gene had two spots on the plate. Notice that each gene
name is repeated. So, the condition for expressed gene is that it must be expressed in at
least two of the files for each sample of the gene. Basically, the sum in column J must the
2 or 3 for both instances of the gene.
There are 1600 rows and so this can be a tedious process. One solution is to use
conditional formatting. This will automatically change the format of cells depending on
the value in those cells. In turn, this will make it easier for the human viewer to spot the
few genes of interest.
Figure 4.18 shows the manner in which conditional formatting is accessed through
LibreOffice. The data that is to be formatted is painted and the user selects Format:
Conditional Formatting: Condition.
The selection creates a pop up window which is in the background of Figure 4.19.
The condition is set to change the formatting if the value is 2 or greater. A new formatting
style is selected and this produces the foreground pop up window which allows the user
to select font, size, and color. In this case the selection is to change the background color
to yellow for the cases where the condition is true.
A small portion of the file is shown in Figure 4.20. There are many locations in
which a cell is painted yellow, but that only occurs for one instance of a gene. In this
figure, there are two genes in which column J is 2 or 3. These are genes to be considered
as expressed in men and not women.
Well, this is a small test and of the three expressed genes none scored a value of 3
in column J. The genes of interest are:
64
Figure 4.19: Changing the format.
65
protein phosphatase 5, catalytic subunit
intercellular adhesion molecule 1 (CD54), human rhinovirus receptor
phospholipase C, beta 2
The NIH contains records for each gene and neither of the last two had any mention
of gender preference. The first one did mention that “elevated levels of this protein may be
associated with breast cancer development” and so the absence of this gene is preferred.[?]
This test was small in size with only a few participants and there was no guarantee that
there were male specific genes in the plates.
66
Part II
67
Chapter 5
Python Installation
Python is one of the fastest growing languages available and is pervasive in all fields
of science. The language is free to obtain and is one of the easiest languages to learn,
particularly for users that have very little programming experience.
The most important feature of Python, though, is that it is a very powerful language
that can perform a wide variety of tasks.
There are two versions of Python that are being used. Version 2.7 is the last in
the version 2 series and has a complete set of tools and is being widely used. Version 3.x
(where the x is still changing) is a newer version that is very slowly replacing 2.7. The
toolset is still catching up to 2.7.
This book uses Version 3.x and attempts to note the differences in places were 2.7
differs.
5.1 Repository
The main website is http://python.org. However, other repositories provide a large set
of third party tools that do not accompany the original installation. This section quickly
reviews methods to install Python on popular platforms.
5.1.1 Windows
Windows users should go to http://scipy.org and get one of the packages that is listed.
These packages tend to have a very large set of third party tools, some of which will be
used here. A no-cost repository will be more than sufficient for the work in this book.
Windows users still have the choice between 32 or 64 bit installations. If the user’s
computer is a 32 bit computer then that is the installation that must be used. A 32 bit
version of Python is sufficient for the work in this book, but will limit the user to only
69
4 GB of workable memory. Thus, if the user is planning on using Python for large scale
problems, then a 64 bit installation may be more appropriate.
5.1.2 MAC
MACs should come with Python installed. However, this installation may not have an
integrated development environment (IDE) or other required modules such as numpy.
Installation of third party components is possible if pip is installed on the computer.
Once pip is installed then the commands in Code 5.1 will install the third party modules
that are required. These commands are executed in the OSx terminal.
If pip is not installed then the commands in Code 5.2 may be used to install the
required modules.
These three installations add onto Python the ability to manipulate vectors and
matrices. They also add a large scientific computing library and the ability to read and
write most types of images.
5.1.3 UNIX
Flavors of Linux should have Python installed. Some even have both versions of Python.
However, installations may not included an integrated development environment (IDE) or
other required modules. Use the Linux software manager to get the additional modules
as needed.
Users need to install:
70
5.2 Setting up a Directory Structure
Figure 5.2 shows the content of HW1 and inside of these subfolders the appropriate
files can be stored. There is a folder for the data and another for documents. The folder
pysrc is where the Python source code will be contained. The folder results is where the
results from the computation would be stored.
Failure to create these directories will eventually lead to lost files. Consider a case
where a student has several homework assignments and he decides to name the file created
by his programs by the name output.txt. If the student is using a single directory then
it becomes quite easy for one homework assignment to erase the results from a previous
assignment. Furthermore, when it is time to turn in the files for the assignment it is
possible to turn in files from a different assignment since all of the files are in the same
directory.
71
5.3 Online Python
There are several online Python resources. Most of these have the basic Python installation
but do not have all of the capabilities that we need.
So far, the only resource has been identified for this course that has the following
components:
Numpy
Scipy
72
Chapter 6
Python is a very powerful language that can perform an extensive variety of tasks. Only
those tasks pertinent to computations for biological applications are reviewed here. The
reader should be quite aware that Python is a far more extensive language than this book
presents.
6.1 Comments
Comments can be inserted into Python code with the # sign. All text following this sign
is not read by the Python interpreter. Most Python editors will color code a comment
line. An example is shown in line 1 of Code 6.1 in which everything to the right of the #
symbol is ignored by Python.
Comments are purely for the human to keep notes on what a program is doing or
what variables mean. Comments can consume several lines but each line must start with
#. Comments are highly recommended particularly if the script will be read by other
users or will be used for multiple purposes. It is easy to understand what a script is doing
just after it is written, but two years later the programmer may have forgotten the purpose
of some of the lines of code. Comments will help jog the memory.
As a rule: you don’t have enough comments in your script.
There are two main types of data: numerical and characters. This section will review
methods by which Python can represent numbers.
73
6.2.1 Assignment
Python, like all programming languages, has variables which can adopt a numerical or
string value. In Python the assignment of a numerical value is quite easy as shown in
Code 6.1. The variable name can have several letters and even numbers (as long as the
number is not the first character) as seen in line 2. Capital letters are treated as being
different that lowercase letters. Printing a value to the console is performed using the
print function as shown in the last lines of Code 6.1.
Variables can be used in mathematical computations as shown in Code 6.2. Python uses
the standard math symbols as shown in Table 6.1.
1 >>> abc = 10
2 >>> a + abc
3 15
4 >>> a * abc
5 50
Function Symbol
Addition +
Subtraction −
Multiplication *
Division /
Power **
Modulus %
Numerical data can be stored in several formats. Most languages offer the ability to
store integer values or floating point values. The precision of these can also be specified.
74
Python does have a complex data type which is not very common among other languages.
In early computers with small amounts of memory the designation of precision was impor-
tant. In today’s modern 64-bit computers this designation is no longer a concern. Some
of the data types in Python are:
Very large values are presented as scientific notation. An example of scientific nota-
tion is: 42300 = 4.23 × 102 . Computer languages use the ‘e’ or ‘E’ symbol to denote the
exponent value. So, the number 42300 in Python can be entered as shown in Code 6.3.
1 >>> 423 e2
2 42300.0
√
Complex numbers are represented in engineering notation where j −1. Line 1 in
Code 6.4 creates a complex value. The real and imaginary parts are extracted as shown
in the Code.
1 >>> g = 3 + 1 j
2 >>> g
3 (3+1 j )
4 >>> g . real
5 3.0
6 >>> g . imag
7 1.0
Converting from one type to another requires the use of a keyword such as int,
float, complex, etc. It should be noted that the int typecast will simply eliminate the
decimal part of the number. In order to compute the rounded value the round function
can be used. These conversions are shown in Code 6.5
Errors can occur in rounding if the variable is exactly halfway between to integers.
Consider line 1 in Code 6.6 where the value of 4.5 is correctly rounded to 5. This was
performed in Python 2.7. Line 3 shows the same operation in Python 3.x and as seen the
75
Code 6.5 Type conversion.
1 >>> float ( 5 )
2 5.0
3 >>> int ( 6.7 )
4 6
5 >>> round ( 6.7 )
6 7.0
rounding function incorrectly rounded to 4. Line 5 shows the case in which a very tiny bit
is added to get the rounding function to provide the correct value.
The result of a computation tends to return a value whose data type is the same as
the most complicated data type in the computation. For example, if an integer is added
to a float then a float is returned. If a float is added to a complex number then a complex
number is returned.
The exception to this rule is integer division in Python 3.x. Integer division in
Python 2.7 returns an integer as shown in the first two lines of Code 6.7. Thus a division
such as 8/9 would return a 0. Python 3.x behaves differently and returns a floating point
value as seen in the last two lines.
76
1. Power,
Consider Code 6.8 which shows a simple computation in line 1. If the process is
done in the order shown then 2 + 5 is 7 and that multiplied by 3 is 21. However, the
answer is 17. The reason is that the multiplications are performed before the additions.
1 >>> 2 + 5 * 3
2 17
3 >>> (2+5) * 3
4 21
Users can control which calculations are performed first by enclosing them in paren-
thesis. Line 3 shows this by enclosing the 2 + 5 in parenthesis and thus this is performed
before the multiplication.
Python does come with a math module that contains basic functions. Code 6.9 shows the
import statement in line 1 that will read in all of the math functions. Not all of them are
shown here.
To raise a number to a power, such as xy , the pow function is used as shown in
line 2. This will perform 34 which produces the answer of 81. The opposite function is
the square root which is called by sqrt as shown in line 4. This computation could also
be performed with the pow function as shown in line 6. In fact, functions such as a cube
can be performed with the pow function by using 1/3 as the second argument.
77
The last function shown is the hypot which computes the hypotenuse of a right
triangle with the length of the sides being used as the argument.
This module also contains several trigonometric functions such as since, cosine,
tangent, their inverse functions and the hyperbolic functions for all. Code 6.10 shows
a simple example of computing the sine of the angle π/2. Like most computer languages
the computation assumes that the input argument is in radians and not degrees. However,
the module provides two conversion functions radians and degrees. Line 3 shows the
conversion of an angle in degrees to radians before the sine is computed.
78
Code 6.11 Exponential functions.
1 >>> e
2 2.7182818 284590 45
3 >>> exp (1)
4 2.7182818 284590 45
5 >>> log (100)
6 4.6051701 859880 92
7 >>> log10 (100)
8 2.0
9 >>> log2 (100)
10 6.6438561 897747 24
79
—
Python offers four methods of collecting items. The items in this chapter are unique to
Python and are not found in many other languages.
tuple
list
dictionary
set
6.3.1 Tuple
A tuple is collection of items that can not be changed. The tuple can contain almost any
type of data such as floats, strings, other tuples, etc. A tuple is encased in parenthesis as
shown in Code 6.12. The following lines create a tuple and then prints the contents to the
screen.
To get single items from a tuple, square brackets are used as shown in Code 6.13.
The number inside of the square brackets is the item number from the tuple. Python,
like C and Java, starts counting at 0. Yes, it is weird, but it does make sense if one
understands how computers point to data in the memory. Anyway, the first item in the
tuple is retrieved in the first line of code. The second and third items are retrieved in
subsequent lines.
The last item in a tuple as shown in line 1 of Code 6.14. Line 2 shows the retrieval
of the next to the last item.
It is possible to get several consecutive items as shown in Code 6.15. Now there are
two numbers inside of the square brackets. The first is the starting point and the second is
the ending point. However, it should be noted that the returned data includes the starting
point but excludes the ending point. This command retrieves items a[0], a[1], and a[2]
but does not retrieve a[3].
80
Code 6.13 Accessing elements in a tuple.
1 >>> a [0]
2 2
3 >>> a [1]
4 4
5 >>> a [2]
6 ' howdy '
1 >>> a [-1]
2 ' CDS 130 Rocks '
3 >>> a [-2]
4 5
Some more examples are shown in Code 6.16 Line 1 retrieves the second, third and
fourth items. Line 3 is the same as line 1 in the previous code. The 0 is not necessary in
this case. Line 5 gets the last two items.
6.3.2 List
A tuple can not be altered. A list is similar to a tuple except that it can be altered. A list
uses square brackets. Line 1 in Code 6.17 creates a list with four items in it. It should be
noted that the list item in the list is the tuple defined above.
An item in a list can be replaced. Line 1 in Code 6.18 changes the first item in the
list.
A list can grow. The append command will attach a new item onto the end of the
list as shown in Code 6.19
6.3.3 Dictionary
A dictionary is similar to a hash table in other languages. The idea is similar to a word
dictionary which contains thousands of entries. Each entry is a word and its definition.
However, a person can only search on the word and can not do a search on the definition.
A dictionary uses curly braces. Line 1 in Code 6.20 creates an empty dictionary and
line 2 creates the first entry in the dictionary. The key is the item in the square brackets
and the value is the item(s) to the right of the equals sign.
The key can be an integer, a float, a tuple or a string. Line 3 in Code 6.20 uses a
string as the key and the value is a tuple. To retrieve an item from a dictionary the key
81
Code 6.15 Accessing consecutive elements in a tuple.
1 >>> a [0:3]
2 (2 , 4 , ' howdy ' )
1 >>> a [1:4]
2 (4 , ' howdy ' , 5)
3 >>> a [:3]
4 (2 , 4 , ' howdy ' )
5 >>> a [-2:]
6 (5 , ' CDS 130 Rocks ' )
1 >>> b [0] = -1
2 >>> b
3 > > >[-1 , 4.6 , ' Hello All ' , (2 , 4 , ' howdy ' , 5 , ' CDS 130 Rocks ' ) ]
1 >>> a = { }
2 >>> a [0] = ' my data '
3 >>> a [ ' John ' ] = ( ' 507 Main ' , ' Cincinnati ' , ' Ohio ' )
82
is used as shown in Code 6.21,
1 >>> a [0]
2 ' my data '
3 >>> a [ ' John ' ]
4 ( ' 507 Main ' , ' Cincinnati ' , ' Ohio ' )
6.3.4 Set
A set is just like the sets that were studied in elementary school. It is possible to perform
intersections and unions as shown in Code 6.22.
4 >>> c . union ( d )
5 Set ([1 , 2 , 3 , 4 , 5])
6
7 >>> c . intersection ( d )
8 Set ([3])
9
10 >>> g = (1 ,2 ,3 ,4 ,3 ,2 ,1 ,2 ,3)
11 >>> set ( g )
12 set ([1 , 2 , 3 , 4])
6.3.5 Slicing
Slicing is the term used for the extraction of part of the information from a tuple, list, etc.
Line 1 in Code 6.21 is an example. Do further demonstrate slicing techniques consider
the tuple defined in Code 6.23. The number of items is extracted by the len command as
shown in line 3.
Several examples are shown in Code 6.24. Line 1 shows the retrieval of the first
item. Line 3 shows the retrieval of the second item. Line 5 shows the retrieval of the last
item. Line 7 shows the method by which the first 4 items are retrieved. Line 7 and 9 are
equivalent. Finally, the last five items are obtained using the command in line 11.
Further examples are shown in Code 6.25. Line 1 obtains every other item in the
tuple. It starts at location 0, ends at location 20, and steps 2 items. The latter number
83
Code 6.23 Length of a collection.
1 >>> a [0]
2 5
3 >>> a [1]
4 6
5 >>> a [-1]
6 0
7 >>> a [:4]
8 (5 , 6 , 7 , 4)
9 >>> a [0:4]
10 (5 , 6 , 7 , 4)
11 >>> a [-5:]
12 (7 , 8 , 9 , 0 , 0)
indicates that it is getting every second item. Line 3 performs the same extraction for the
entire tuple.
Line 5 starts with the last item and ends with the first item. The step value is -1,
so this is stepping backwards through the tuple. It is getting items in reverse order. Line
8 performs the same function for the entire tuple.
Consider the case in Code 6.26 in which a tuple named a is inserted into a tuple
named b. The third item in b is obtained by line 3. This is the entire tuple a. To get
individual components from the inner tuple, two slicing components are required as shown
in line 5. Here b[2] is an entire tuple and thus (b[2])[-1] is the last item in that inner
tuple.
Code 6.27 shows a slightly different process in which the tuple a is inserted into a
list named c. Line appends another item to this list. Line 5 inserts a string in position 1.
Items can be removed from a list in two ways. The first is to use the pop function
shown in Code 6.28. This will remove the item from the list and a variable can be assigned
the value that is removed. The argument to the function is the location of the data that
is to be removed. So, pop(0) means that the first item will be removed from the list and
the variable g will become that first item.
The second method shown in Code 6.29 which uses the remove function. This will
remove an item from the list but the argument must be the data that is to be removed.
84
Code 6.25 More slicing examples.
1 >>> a [0:20:2]
2 (5 , 7 , 2 , 6 , ' snow days ' , 1 , ' GMU ' , 6 , 8 , 0)
3 >>> a [::2]
4 (5 , 7 , 2 , 6 , ' snow days ' , 1 , ' GMU ' , 6 , 8 , 0)
5 >>> a [20:0:-1]
6 (0 , 0 , 9 , 8 , 7 , 6 , 5 , ' GMU ' , -1 , 1 , 3 , ' snow days ' , ' string ' ,
7 6 , 4 , 2 , 4 , 7 , 6)
8 >>> a [::-1]
9 (0 , 0 , 9 , 8 , 7 , 6 , 5 , ' GMU ' , -1 , 1 , 3 , ' snow days ' , ' string ' ,
10 6 , 4 , 2 , 4 , 7 , 6 , 5)
1 >>> a = (1 ,2 ,3)
2 >>> b = ( ' hi ' , ' hello ' , a , ' guten tag ' )
3 >>> b [2]
4 (1 , 2 , 3)
5 >>> b [2][-1]
6 3
85
If there are two instances of the data then only the first one is removed. For example, if
the list c had two strings “guten tag” then the function in line 1 would have to be called
twice to remove them both.
6.4 Strings
In some cases the available data is represented as characters rather than numerals. For
example, DNA is represented as a string of letters. These long strings are then analyzed
by algorithms. Thus, it is necessary to understand how strings are managed within a
computer program.
A string is a collection of characters. Strings can be defined by using either single quotes
or double quotes as shown in Code 6.30.
Extracting characters from a string is performed through slicing using the same rules
as slicing in a tuple or list. A few examples are shown in Code 6.31. Line 1 retrieves the
first time, line 3 retrieves the first 7 items, and line 5 retrieves the string in reverse order.
86
6.4.1.1 Special Characters
Code 6.32 shows a string in line 1 that has a \t and a \n. The first is the tab character
and the latter is the newline character. When using the print function the function of
these characters are shown.
6.4.1.2 Concatenation
A string can not be changed, but it is possible to create a new string from the concatenation
of two strings. An example is shown in Code 6.33. Two strings are created and in line 3
the plus sign is used to create a new string from the two older strings.
87
6.4.2 Functions
Several functions are defined to manipulate strings and return information about their
contents. Only the functions used in the subsequent chapters are reviewed here.
The find command will find the location of a substring within a string. Three
examples are shown in Code 6.35. Line 1 finds the first occurrence of “is” in string st1.
The function returns 2 which means that the target is found starting at st1[2]. There
are two instances of “is” inside of st1 and this function only returns the first instance.
Line 5 starts the search at position 3 which is after the location of the first occurrence of
“is.” Thus, it finds the second occurrence which is at position 5. Line 9 starts the search
after position 5 and the return is a -1. This indicates that the search found no occurrence
of “is” from the given starting point.
The count function counts the number of occurrences of a target string. An example
is shown in line 1 of Code 6.36. The rfind function performs a reverse search, or in other
words finds the last occurrence of the target as seen in lines 3 and 4.
The case of a string can be forced by the commands upper and lower as shown in
Code 6.37.
The split function will split a string into substrings. This is shown in lines 1 through
3 in Code 6.38. Line 3 shows the result which is a list of strings. The string st1 was split
on a blank space and this blank space is not in any of the substrings in line 3. It is possible
to split on any character (or characters) by placing that character(s) as the argument to
88
Code 6.37 Conversion to upper or lower case.
the function. An example is shown in line 10. Notice that the splitting argument is not
included in any of the strings. The answer list has an empty string because there was
nothing between the first two instances of “is”.
The join function is the opposite of split. The first example is shown in line 4. The
string ’X’ is the glue and as seen in line 6 the join function created a single string that
consists of all of the strings in the list glued together by the string that was in front of the
join command. Line 7 shows the second example and this the glue is an empty string and
as seen in line 9, all of the strings from the list are put together with nothing in between
them.
It is possible to replace a substring with another using the replace function. Consider
Code 6.39 which starts with the definition of a DNA string in lines 1 and 2. All lowercase
‘a’s are replaced by uppercase in line 3. The result is shown in line 5, and it is possible to
replace more than just single characters as shown in lines 7 through 10.
DNA is a double helix structure and the complement of one helix is contained on the
other strand. To create the complement string the ‘a’ and ‘t’ characters are exchanged.
So, a ‘t’ is located wherever there is an ‘a’ in the original sequence. The letters ‘c’ and ‘g’
are also swapped. Finally, the complement is in reversed order of the original.
89
Code 6.39 Using the replace function.
A swap requires threes steps. It is not possible to just convert all ‘a’s to ‘t’s because
the new string would have both the new and old ‘t’s. So, it is necessary to convert the
‘a’s to something other than the letters contained in the string. This was accomplished
in line 3 in Code 6.39. That was the first of the three steps. The next two are shown in
Code 6.40 where the ‘t’s are converted to ‘a’s and then the old ‘a’s are converted to ‘t’s.
Code 6.40 Creating a complement string.
The output in lines 4 and 5 show a string where the ‘a’s and ‘t’s have been swapped.
The same process then needs to be applied to swap the ‘c’s and ‘g’s. This is performed
in lines 6 through 8. Finally, the string is reversed in line 9 to finish the creation of the
complement DNA string.
In the previous section there were only two types of swaps that needed to be performed.
Other applications may require a much larger array of substitutions and for those the
90
previous method will be cumbersome. A more efficient method uses a look up table. This
process is shown in Code 6.41. Line 2 creates this table using the maketrans function.
This creates a look up table in which each character is the first string is replaced by the
respective character in the second string. Line 3 applies this table to the DNA using the
translate function. The output comp has replaced all of the characters with the new
characters and line 4 reverses the string.
A string with numerical data can be converted to a numerical form using the appropriate
command. Examples are shown in Code 6.42. The first two lines convert strings to an
integer or a float. The third line converts a number into a string.
This section will show a few examples of string manipulation using the play Romeo and
Juliet.
The first question is: Which person is named the most frequently? The answer is
shown in Code 6.43. The data is loaded in lines 1 and 2. Line 3 counts the number of
occurrences of “Romeo” and line 5 counts the number of occurrences of “Juliet.” As seen
Romeo is mentioned significantly more times. Even Tybalt is mentioned more often than
Juliet.
The second question is which person is named first, Romeo or Juliet? Line 1 in Code
6.44 finds the first occurrence of “Romeo” and the returned result is 0. That means that
91
Code 6.43 Counting names in the play.
the very first word in the file is “Romeo.” This makes sense since the name of the play is
the first part of the file.
So, the search is modified to start after Shakespeare writes “SCENE I.” Line 3 finds
the location of this string and line 5 begins the search at that location thus excluding the
title from the search. As seen the first location of “Romeo” after the play starts is at
position 6570. “Juliet” appears much later than that, so Romeo is mentioned first.
The third question is: which one ended the most sentences that end with a period.
The process is similar except that the search string includes a period. Code 6.45 shows
the results and as seen Romeo wins again.
The fourth question is how many unique words did Shakespeare use. Now, this is
a bit tricky and the results shown are not exactly correct. Currently, the text includes
upper and lower case letters and thus would treat “The” and “the” as different words.
92
Furthermore, all punctuation is included so “Romeo” is counted differently than “Romeo.”
or “Romeo,”. So the results show the upper limit of the number of words and unique words
that were used. Line 1 in Code 6.46 shows that upper limit on total words to be 25,643.
The set command will eliminate duplicates and so this command can be used to find the
number of unique entries which in this case is 6338.
Problems
1. Assign variables aa the value of 4 and bb the value of 9. Compute cc which is the
addition of these numbers.
4. Write Python script to round the following values: 8.2, 4.5, 9.8.
5. Put the following items in a tuple: 5, 6.7, ’a string’, 4, 1+6j. Return the length of
the tuple.
6. In the tuple in the previous problem, retrieve every other item and print to the
console.
7. Convert the tuple created in problem 5 to a list. Append ’New String’ to that list.
10. Create a string that is ’aeiouAEIOU’. Using maketrans create a new string in which
is ’AEIOU,aeiou’.
93
94
Chapter 7
Most programming languages have a few commands that control the flow of the program.
These are used to repeatedly perform the same computation or to make decisions. Python
is no exception and control is managed by the if, while and for commands.
The if command steers the program depending on the truth of a condition. For example,
the program has two choices. If x > 5 then it would do one thing, but x is less than or
equal to 5 then it would do something else. This is a decision. A simple example is shown
in Code 7.1 where the decision is made in line 1. If c > 5 then the program would do
whatever is in lines 2 and 3.
1 if c >5:
2 command 1
3 command 2
Python is heavily reliant on the use of indentations. The if command ends with a
colon and then in this example the next two lines are indented. All of the lines that are
indented are inside of the if statement.
Python indentations must be consistent throughout. In the previous code the in-
dentations are 4 spaces. It is important that the commands have exactly the same number
of spaces. The program will not execute if line 2 has an indentation of 3 spaces and line
3 has an indentation of 4 spaces.
Editors such as IDLE have the default setting of inserting 4 white spaces when the
TAB button is pressed. Other, editors, however, may be set up to insert a TAB character
95
when that same button is pressed. Even though the TAB indentation may look the same
as a 4 space indentation they are not the same. Python compilers will not accept this
situation. It is prudent to ensure that all Python editors a user employs uses 4 spaces for
indentations.
A working example of the if statement is shown in Code 7.2. The variable x is set
to 6 in line 1 and in line 2 the program checks to see if x is greater than 4. This is True
so it then executes line 3 and the result is shown in line 5.
Code 7.2 The if statement.
1 >>> x = 6
2 >>> if x >4:
3 print ( ' Yes ' )
4
5 Yes
This example used the greater than comparison. There are many comparisons as
shown in the following list.
A second example is shown in Code 7.3 where two commands are executed if the if
statement is true. It line 1 was changed to if x<4: then neither one of the print statements
would be executed.
Code 7.3 Two commands inside of an if statement.
1 >>> if x >4:
2 print ( ' Yes ' )
3 print ( ' More yes ' )
4
5 Yes
6 More yes
Code 7.4 uses the else statment. This is used to execute commands if the if condition
is false. So, in this case if line 1 was true then lines 2 and 3 are executed. If line 1 is false
then line 5 is executed.
96
Code 7.4 Using the else statement.
1 >>> if x >4:
2 print ( ' Yes ' )
3 print ( ' More yes ' )
4 else :
5 print ( ' No ' )
The condition for the if statement can include multiple tests as shown in Code 7.5. The
condition in line 3 uses the and and therefore both conditions must be true in order to
execute line 4. There are three words that are used in complex conditions:
and
not
or
1 >>> x = 6
2 >>> y = 5
3 >>> if x >4 and y >3:
4 print ( ' OK ' )
5
6 OK
Similar to other languages, Python has a particular order in which the conditions
are tested. If the conjunctions are the same (perhaps two and statements) then they are
evaluated in order of appearance. if the conjunctions are different (perhaps an and and
an or) then Python employs a structured hierarchy.
Consider 7.6 which has three conditions. The first condition is c>2 which is True
and the second is a<0 which is False. The conjunction between them is or and so this
combination should result in True. The next condition is b<0 which is False and the
preceding conjunction is and and therefore the entire statement should be False. If this
were the case then the word ’Yes’ would not be printed to the console. Yet, this is exactly
what occurred.
Python uses a decision hierarchy in which all ands are considered before the ors.
Thus in this case, the a<0 and b<0 is evaluated first. This if False. Then the next
evaluation is c>2 or False which is True and the statement is printed to the console.
Parenthesis are used to control the order of evaluation and Code 7.7 shows that c>2
or a<0 can be evaluated first.
97
Code 7.6 A compound statement.
1 >>> a =1
2 >>> b =2
3 >>> c =3
4 >>> if c >2 or a <0 and b <0:
5 print ( ' Yes ' )
1 >>> a =1
2 >>> b =2
3 >>> c =3
4 >>> if ( c >2 or a <0 ) and b <0:
5 print ( ' Yes ' )
1 >>> if a <3:
2 print ( ' Yes ' )
3 elif b >0:
4 print ( ' No ' )
5 else :
6 print ( ' Maybe ' )
7
8 Yes
7.2 Iterations
Iterations are used to perform the same commands repeatedly. There are two main meth-
ods that this is accomplished. These are the while and for loops.
98
7.2.1 The while Loop
The while loop will repeatedly perform the same steps until a condition becomes False.
The Code 7.9 sets anum equal to 0 in line 1. Line 2 starts the while loop and as long as
anum is less than 4 it will execute lines 3 and 4. However, line 4 changes the value of anum
and eventually it becomes equal to 4 and the condition in line 2 is no longer True. Then
Python does not execute lines 3 and 4 and goes on to any steps that are after the while
loop. The condition statement can also use parenthesis and the keywords and, or or not.
1 >>> anum = 0
2 >>> while anum < 4:
3 print ( anum )
4 anum = anum + 1
5
6 0
7 1
8 2
9 3
The for loop performs iterations but over a finite list or tuple of items. Line 1 in Code
7.10 creates a list named blist. The for loop is created in line 2 and the variable i will
become each item in the list. So, line 3 is executed four times and i is a different item in
the list in each of the iterations.
5 1
6 GMU
7 snow days
8 2
99
and going up to but not including the argument in the command. If two arguments are
used (line 3) then they define the starting and ending parts of the list. If three commands
are used (line 5) then they represent start, stop and step.
1 >>> range ( 10 )
2 [0 , 1, 2, 3, 4, 5, 6, 7 , 8 , 9]
3 >>> range (2 , 10)
4 [2 , 3, 4, 5, 6, 7, 8, 9]
5 >>> range ( 2 , 10 , 2 )
6 [2 , 4 , 6 , 8]
7 >>> for i in range ( 5 ):
8 print ( i , end = ' ' ) # py 3.4
9 print i , # py 2.7
10
11 0 1 2 3 4
12 >>> list ( range ( 10 ) ) # py 3.4
The range function is changed in Python 3.x and it no longer returns a list. However,
this is easily remedied by converting the return using the list function as shown in line
12.
The for loop in line 7 uses the range command. Thus, i will become integers 0
through 4. The comma after the print statement is used to prevent Python from printing
a new line with each iteration. Thus, the output appears together on line 10.
Consider Code 7.12 in which an if statement resides inside of a for loop. For a couple
of iterations the if statement in line 3 is False. Eventually, it becomes True and then
line 4 is executed. The only command is the break command which immediately takes
the program outside of the for loop. To demonstrate this there are two print statements.
When i is 0, 1, or 2, the if statement is False and the print statement in line 5 is executed.
However, when i = 3 then line 2 is printed. Next the if statement is evaluate to be True
and then the break command is executed immediately stopping the iterations in the for
loop. As seen in line 10 the ‘CCC’ was not printed when i was 3. Furthermore, the value
of i never becomes 4.
The continue command is related to the break command. Code 7.13 shows an
example. Line 4 contains the continue command, and this command will terminate the
current iteration but will allow subsequent iterations to proceed. As seen in line 10, the
value of i does become 4.
100
Code 7.12 Using the break statement.
8 0 CCC
9 1 CCC
10 2 CCC
11 3
7 0 CCC
8 1 CCC
9 2 CCC
10 3 4
101
7.2.4 The enumerate Function
Data that comes in a list may need to have indexes to assist in further programming. This
is accomplished with the enumerate function. Line 1 in Code 7.14 creates a list of five
workdays by name. The for loop uses the enumerate function to return the index and
the value of the data from the original list.
1 >>> adata = ( ' Monday ' , ' Tuesday ' , ' Wednesday ' , ' Thursday ' , '
Friday ' )
2 >>> for a , b in enumerate ( adata ) :
3 print ( a , b )
4
5 0 Monday
6 1 Tuesday
7 2 Wednesday
8 3 Thursday
9 4 Friday
7.3 Examples
This section displays several examples using the combination of control statements.
This example is to compute the average from a set of random numbers. Python offers the
function random that will generate a random number between 0 and 1. This function will
generate random numbers that are evenly distributed, which means that there is an equal
chance of getting a random number near 0.1 as one near 0.5. Thus, the average of many
random numbers should be very close to 0.5. This function is in the random module and
is shown in Code 7.15.
Code 7.15 Generating random numbers.
Code 7.16 creates an empty list and then fills it with 1000 random numbers. In line
3 a random number is generated and put in a variable named r, and Line 4 appends this
to the list.
102
Code 7.16 Collecting random numbers.
1 >>> coll = []
2 >>> for i in range ( 1000 ) :
3 r = random . random ()
4 coll . append ( r )
Code 7.16 is not the most efficient manner in which this can be done, but it shows
the steps involved.
The average of a set of numbers is computed by,
N
1 X
a= xi . (7.1)
N
i=1
Here the individual variables are xi where i goes from 1 to N , where N is the number of
samples. The computation of the average is shown in Code 7.17. In line 1 the variable sm
is set to 0. The loop in lines 2 and 3 computes the sum of all of the numbers in the list
coll. Line 5 divides by the total number of samples. As seen, the average is very close to
0.5. If a large data set is used then the average will be even closer to 0.5.
1 >>> sm = 0
2 >>> for i in coll :
3 sm = sm + i
4
5 >>> sm /1000.
6 0.495 51 4 1 09 8 5 10 7 7 63
The codes that have been shown are not the most efficient method of implementation.
Functions from the numpy module can improve both coding efficiency and performance
speed. Line 2 in Code 7.18 creates a vector (an array) of 1000 random numbers and line 3
computes the average of that vector. Again the average is close to 0.5. These commands
are reviewed in Chapter 11.
103
7.3.2 Example: Text Search
In this example the task is to find all of the words that follow the letters ‘the’. The text
that will be used will be converted to lowercase. This search will look for the letters ’the’
followed by a space. However, this process is not perfect and it will consider a word like
‘bathe’ to be a positive hit since it will end with ‘the ’ (including a space and the end).
Code 7.19 shows the script for loading the text file that contains the script from Romeo
and Juliet.
The real work is down in Code 7.20. Line 1 starts with an empty list named answ.
The for loop started in line 2 will set the variable to integer values to 3 less then the
length of the string. The if statement in line 3 then determines if the string at location i
through i+4 has the four characters ‘the ’ (including the space). If it does then the next
step is to isolate the word that follows that space. That word will start at i+5 but the
where that word ends is unknown. So, line 4 finds the location of the next space. This
location is stored in the variable end. Thus the next word after the ‘the ’ starts at location
i+4 and ends at the location end. This is the word and it is appended to the list answ.
1 >>> answ = []
2 >>> for i in range ( len ( data )-3) :
3 if data [ i : i +4] == ' the ' :
4 end = data . find ( ' ' ,i +4 )
5 answ . append ( data [ i +4: end ] )
6
Line 7 shows that there are 672 entries in this list and line 9 prints out the first 10
of these. These are some of the words that follow ‘the ’ in the play Romeo and Juliet.
There may be duplicates in this list and they can be remove by using the set and
list commands as show in Code 7.21. The set command will remove the duplicates as
it creates a set. The list command converts that result back into a list.
104
Code 7.21 Isolating unique words.
Figure 7.1 shows the sliding box problem in which a box slides (without friction) down
an inclined plane with an angle of θ to the horizontal. The acceleration that the box
experiences is
a = g sin θ, (7.2)
where g is the acceleration due to gravity. Thus, the speed of the box is computed by,
v = gt sin θ. (7.3)
where t is the time. In this example a Python script is written to calculate the velocity of
the box at specific times.
The process is shown in Code 7.22. For this problem there are two functions that
are needed from the math module. The sin function computes the sine of an angle.
However, Python, like all computer languages uses radians instead of degrees for the
angles. Therefore, the radians function is used to convert the angle from degrees to
radians. These two functions are imported in line 1. Line 2 sets the gravity constant to
9.8 and line 3 sets the angle theta to 20 degrees.
In this example, 10 time steps are printed and this process begins with the for loop
in line 4. The task is to compute the velocity for every quarter of a second. So, the variable
t is one-fourth of the integer i as computed in line 5. Line 6 computes the velocity and
prints it to the console. Four items are printed. The first two are the variables i and t.
The third item is a tab character which is used to make the output look nice. Finally, the
velocity at each individual time is printed.
105
Code 7.22 Computations for the sliding block.
8 0 0.0 0.0
9 1 0.25 0.837949351148
10 2 0.5 1.6758987023
11 3 0.75 2.51384805344
12 4 1.0 3.35179740459
13 5 1.25 4.18974675574
14 6 1.5 5.02769610689
15 7 1.75 5.86564545804
16 8 2.0 6.70359480918
17 9 2.25 7.54154416033
In this example the value of π is calculated using random numbers. Consider Figure 7.2
which has a square that has a length and height of 2. Inside of this square is a circle with
a radius of 1.
and the area of the circle is πr2 but since r = 1 in this case the area is just ,
A2 = πr2 = π. (7.5)
106
Now consider that a dart is thrown at Figure 7.2 and it lands inside of the square.
There is also a probability that it will land inside of the circle. The probability of the dart
landing inside of the circle is,
A2 π
p= = . (7.6)
A1 4
Thus, p = π4 or in other words, π = 4p. So, if the value of p can be determined then the
value of π can be determined. Now consider the idea of throwing thousands of darts at
the image. The probability p is the total number of darts that land inside of the circle
divided by the total number of darts. The question is then, how can we write a program
to throw these darts?
A dart is a random location inside of the square. This can be defined by a point
(x, y) where both x and y are random numbers between -1 and +1. Any dart that is inside
of the circle has a distance of less than 1 to the center of the circle. The distance from the
center of the circle to a point (x, y) is determined by,
p
d= x2 + y 2 . (7.7)
Code 7.23 shows the process to perform these computations. Line 1 creates the
variable total which will count the total number of darts. Line 2 creates the variable
incirc will will count the total number of darts inside of the circle. Both of these are
initialized to 0.
1 >>> total = 0
2 >>> incirc = 0
3 >>> for i in range ( 1000000 ) :
4 x = 2 * random . random () - 1
5 y = 2 * random . random () - 1
6 d = sqrt ( x * x + y * y )
7 if d < 1:
8 incirc = incirc + 1
9 total = total + 1
10
The for loop starts in line 3 and will iterate one million times. Lines 4 and 5 create
the random (x, y) by creating two random numbers between -1 and +1. The distance to
the center is computed in line 6. If this value is less than one then the value incirc is
increased by one. This is counting this particular dart as being inside of the circle. Every
dart gets counted in line 9. Finally, π = 4p is compute in line 11. As seen the result in
line 12 is quite close to the value of π.
107
7.3.5 Example: Summation Equations
This section will demonstrate the process of converting a summation equation into Python
scripts. Consider the case where the initial data is a small tuple as shown in line 1 of Code
7.24. The len function returns the length of the tuple as seen in lines 2 and 3. The range
function returns a list that starts with 0 and increments up to the given number as seen
in lines 4 and 5.
This is accomplished by a for loop as shown in Code 7.25. The answer is placed in
a variable named answ which is initialized to 0 in line 1. Lines 2 and 3 are inside the for
loop. Table 7.1 shows the value of each variable through each iteration. The final answer
is printed to the console in line 6.
1 >>> answ = 0
2 >>> for i in range ( len ( x ) ) :
3 temp = answ + x [ i ]
4 answ = temp
5 >>> answ
6 14
Code 7.26 shows the same process with the necessary modification in line 3. Note that
line 1 is required since the variable answ was changed in Code 7.25.
108
Table 7.1: Values of the variables during each iteration
Line 3 Line 4
i x[i] answ temp answ
0 1 0 1 1
1 2 1 3 3
2 5 3 8 8
3 6 8 14 14
In this case the N1 is outside of the for loop. Therefore, the loop is completed before
multiplied by the fraction as shown in Code 7.27. Lines 1 through 3 are the same as
in Code 7.25. The loop is completed before line 5 is executed. Now, the summation is
complete and the the for loop is finished. Line 5 performs the multiplication by N1 and
the numerator is a floating point value so that the fraction is also floating point.
1 >>> answ = 0
2 >>> for i in range ( len ( x ) ) :
3 answ = answ + x [ i ]
4
7.4 Problems
1. Write a Python script that sets x = 9 and y = 10. The script it prints to the console
YES if x > y and NO otherwise.
109
3. Create a while loop that starts with x = 0 and increments x until x is equal to 5.
Each iteration should print to the console.
4. Repeat the previous problem, but the loop will skip printing x = 5 to the console
but will print values of x from 6 to 10.
6. Create a list of 10 data points in the form of (x, y). The values of these points can
be randomly assigned. Write a Python script in which both the x and y values are
used as indexes in the for loop. Print the values for each iteration.
7. Using the random dart method show that the area of a right triangle is half of the
area of the bounding box.
8. Using the random dart method show that the area of any triangle is half of the area
of the bounding box. The user should define the triangle by defining the corners as
three points in space.
9. Section 7.3.4 uses a circle that is inside of a square. Using the random dart method
compute the area of a square that is inside of a unit circle with all four corners
touching the circle.
110
Chapter 8
This chapter reviews methods in which Python can read and save text files.
There are three basic steps to reading data from a file on the hard drive. These are:
1. Open a file,
2. Read the data, and
3. Close the file.
Consider a case in which the text file “mydata.txt” already exists on the hard drive
and the goal is to read this data into the Python. The three steps are shown in Code 8.1.
Line 1 opens the file using the file command. Newer versions of Python use the open
command instead. The variable fp is a file pointer and contains information about where
the file exists on the hard drive and the position of the reader. When the file is opened
the position is at the beginning of the file but this can be altered by the user as shown in
Section 8.3.
Code 8.1 Reading a file.
Line 3 reads the entire file into Python. The variable data is a string. if the data is
numerical in nature then it will be necessary to convert the string into a numerical value.
This is discussed in Section 6.5. Line 4 closes the file. It is good practice to close files
when access is finished.
111
Code 8.1 assumes that the data file is also in the current working directory. If that
is not the case then it is necessary to include the file structure when reading a file. The
example using a full file structure is shown in line 1 in Code 8.2. Line 2 shows the case of
reading data that is in a subdirectory of the current working directory.
Code 8.2 uses forward slashes to delineate the directories in the structure. This
is the style used in UNIX and OSx systems. Windows uses backslashes to delineate the
subdirectories. However, backslashes are also used to denote special characters such as a
tab (\t) or newline (\n). The Python solution is that the two backslashes can be used to
delineate the directory structure, or the forward slashes will still work in Windows.
It is possible to open, read and close a file in a single command. This shortcut is
shown in line 4 in Code 8.2.
Storing data in a file follows a similar process in that the protocol is:
These steps are shown in Code 8.3. Line 1 opens the file, and it should be noted that
if a file with the “output.txt” previously existed that line 1 will eliminate that file. There
is no warning, if line 1 is executed then the previous file is gone for good. The argument
’w’ is the flag that indicates that this file is open for writing.
Line 2 writes the string indata to the file and line 3 closes the file. The only data
that can be written by this method is a string. If the data is numerical then it must be
converted to a string before it can be saved.
112
The methods shown save the data as an text file. The advantage is that the data
can be read on any platform. So, a file can be stored on Windows and then read on a
Mac. The disadvantage is that the files can become large. The alternative is to store data
in binary format which has just the opposite features. Files are not easily transformed
from one platform to the next because Windows and Mac store binary data differently.
However, the files can be smaller particularly for a lot of numerical data. Code 8.4 shows
the lines for opening a file for writing and reading binary data.
Modern biological labs rely on robots and computers to process and collect the data. The
experiments can process a large array of data and store it all in a single file. The files will
have several components such as information about the protocol, date, users, experiment,
data locations within the file and the raw data.
Reading such a file can be done in two ways. One is to load the entire file, which
can be several megabytes, and then process the data. The second is to move about the file
stored on the disk and extract the pertinent components. For example, early sequencers
would produce a file that had header information (date, etc.) which included the location
of the information about the data. This was at the end of the file. So, it was necessary
to jump towards that section of the file and then read the information about where the
raw data was kept in the file. Then the program needed to move backwards in the file to
where the raw data was stored.
So, it is necessary to have the ability to move about a file so that specific components
can be read. This is accomplished with the seek command. Code 8.5 shows an example.
Line 1 opens the file and line 2 moves the position to the 6th byte in the file. The read
command in line 3 has an integer argument which is the number of bytes to read. In this
case, only 1 byte is read. Line 4 repositions the file to the 3rd byte from the end and then
another single byte is read. This is just a very simple example, but these are the steps
that are used to move about a file and read in a specific number of bytes.
The current position in the file is returned by the fp.tell() command..
8.4 Pickle
Python offers terrific collections such as tuples and lists. Saving this data would be difficult
if every component was required to be converted to a string. The pickle module offers the
113
Code 8.5 Using the seek command.
ability to store multiple types of data in a single command. The process is shown in Code
8.6 which starts with the created of a tuple that contains another tuple. Line 3 opens the
file for writing in the same manner as a binary file (required since Python 3.x). Whenever
a file is opened with the command open it will destroy any file with the same name. There
is no warning and no Ctl-Z that can reverse the deed. Once the command is executed the
previous file with the same name is gone.
Line 4 imports the pickle module and line 5 shows the single dump command that
stores everything in a text file. Code 8.7 shows the method to read in a pickled file. The
file is opened in the normal manner but the reading is performed by the load command.
As seen data is the nested tuple that was created in Code 8.6.
8.5 Examples
114
8.5.1 Sliding Window in DNA
In this example a DNA string will be analyzed to compute the percentage of ‘t’s within
a sliding window. The first step is to load the DNA data. Line 1 in Code 8.8 shows the
command to read in the file as a single long string. In this case fname is the name of
the file that contains the data. Line 3 shows that this is a very long string with 4 million
characters.
Long strings should NEVER be printed in the IDLE shell. The process is uses for
printing will take an extremely long time. It is possible to print out a small portion of the
string as shown in Line 4.
In this task the goal is to compute the percentage of ‘t’s in a window of 10 characters.
This window is then moved to the next 10 characters and the the ‘t’ percentage is calculated
for this new window. Line 1 in Code 8.9 shows the command to print the first ten
characters. Line 3 counts the number of ‘t’s in this small string. Line 5 computes the
percentage of ‘t’s in this small string. Line 7 computes the percentage for a different set
of 10 characters starting at position 200.
The goal is to consider 10 characters in one window which starts at position k. Then
this window is moved to the next position which is at k + 10. Code 8.10 shows this task.
First an empty list is created in line 1 which will catch the answers as they are generated.
The pct is the percentage of ‘t’s for that window. Note that the percentage uses a floating
point 10 instead of an integer. This percentage is appended to the list in line 5 and the
answer is shown. These are the percentages of ‘t’s in a sliding window of length 10.
115
Code 8.10 A sliding window count.
1 >>> answ = []
2 >>> for i in range ( 0 , 100 , 10 ) :
3 count = dna [ i : i +10]. count ( ' t ' )
4 pct = count /10.0
5 answ . append ( pct )
6 >>> answ
7 [0.3 , 0.2 , 0.2 , 0.2 , 0.2 , 0.3 , 0.0 , 0.3 , 0.0 , 0.2]
The DNA string, however, is longer than 100 characters. So, a small modification is
needed in order to compute the percentages for the sliding window for the entire string.
The value of 100 needs to be replaced by the length of the string. The change is shown
in Code 8.11. Line 2 replaces the end value with len(dna). The answer is now a list of
over 400,000 numbers. It is highly recommended that the entire list NOT be printed to
the console.
1 >>> answ = []
2 >>> for i in range ( 0 , len ( dna ) , 10 ) :
3 count = dna [ i : i +10]. count ( ' t ' )
4 pct = count /10.0
5 answ . append ( pct )
6 >>> len ( answ )
7 440384
This example shows a method by which a spreadsheet page can be read and parsed in
Python. The first step, of course, is to save the page from the spreadsheet as a tab
delimited file. The sample data is shown in the spreadsheet in Figure 8.1. This data is
saved as a tab delimited text file named sales.txt.
Code 8.12 shows the command to read in the file. The variable sales is a string
that has 1152 characters. The first 100 characters are printed and as seen this is the top
row of the spreadsheet. Each column is separated by a tab (\t) and each row is separated
by a newline character (\n).
Code 8.13 shows the steps in parsing the first line of data. Line 1 uses the split
command to separate the data into a list of strings. Each string is one row from the
spreadsheet. So lines[0] is the first row of the spreadsheet as shown in line 3. Line
4 splits that single string on the tab characters and thus the first row becomes a list of
116
Figure 8.1: Data in a spreadsheet.
117
strings where each string is one cell in the spreadsheet. This is shown in line 6.
The data starts in the second line of the of the spreadhsheet and line 1 in Code
8.14 splits this line into its constituents. Note that in line 2 all of the items in the list are
strings.
Code 8.15 shows the method by which the data is read and converted to floats for a
single line. The list temp is created in line 1. Line 3 splits lines[1] into its constituents
which is the same as line 2 in Code 8.14. The first item a[0] is the string Bath Towels
and therefore the conversion to numerical starts with a[1]. The for loop starts at 1 and
line 4 converts each of the numerical items to a float.
1 >>> temp = []
2 >>> a = lines [1]. split ( ' \ t ' )
3 >>> for j in range (1 , 4) :
4 temp . append ( float ( a [ j ] ) )
5 >>> temp
6 [6.95 , 5.0 , 319.0]
The entire process is shown in Code 8.16. It should be noted that the text file has
one line at the end of the file that is empty. This is normal when a spreadsheet page
is saved as a text file. Line 1 creates an empty list that will hold all of the numerical
data. Line 2 starts the for loop which excludes the first line since it has header data
and excludes the last line since it is known to be empty. Line 3 is the same process as in
Code 8.15 except that the process is applied to all rows as the outer loop goes through its
iterations.
118
Code 8.16 Converting all of the data.
1 >>> answ = []
2 >>> for i in range ( 1 , len ( lines )-1 ) :
3 temp = []
4 for j in range ( 1 , 4 ) :
5 a = lines [ i ]. split ( ' \ t ' )
6 temp . append ( float ( a [ j ]) )
7 answ . append ( temp )
Problems
2. In the DNA string there are regions that have a repeating letter. What is the letter
and length of the longest repeating region?
4. In Romeo and Juliet retrieve all of the capitalized words that do not start a sentence.
Use set and list to remove duplicates from this list.
7. What is the largest distance (number of characters) between two consecutive in-
stances of the word “Juliet”. (The previous problem will be of assistance.)
8. What is the most common word in Romeo and Juliet that is at least 5 letters long?
119
120
Chapter 9
There are now many online archives of biological data and often this data is available
in the form of a spreadsheet. This chapter will review the different methods by which
spreadsheets can be read by Python.
In the first method the user would save the spreadsheet page as a tab delimited text
file and then use Python to read the file and parse the data. The second method reads
that same file using the csv module. When the spreadsheet saves as a text file only the
data is saved. Plots, charts, equations and formatting are all lost. There are modules that
allow the user to read and write spreadsheets including these features. The third section
in this chapter reviews the xlrd module which can read a spreadsheet file directly. The
final method is the openpyxl which can read and write to the .xlsx format. While these
latter two methods can write to spreadsheets, this chapter only reviews the methods of
reading the data. There are many aspects of these modules which are not covered here.
The first method requires that the user save the spreadsheet page as a tab delimited text
file. This saves only a single page and only the data therein. It is important to use the tab
delimited option instead of the comma delimited option because some of the fields like a
gene name may contain commas.
A spreadsheet can be saved in many different formats. In LibreOffice the user selects
the File menu and the Save As option. At the bottom of the dialog there is an option
to change the format of the file to be saved. The selection is changed to Text CSV. The
first pop up dialog is shown in Figure 9.1(a) and the “Use Text CSV Format” should be
selected. The creates a second dialog that is shown in Figure 9.1(b). Here the user needs
to make the correct choices as shown in the first two fields. UTF-8 is standard text format
and Tab is selected as the delimiter.
The output is a text file which contains the data from the spreadsheet. Each cell is
121
(a) (b)
separated by a Tab and each row is separated by the newline character which appears as ‘
n’ when displayed. Figure 9.2 shows two parts of a very large spreadsheet and Code 9.1
loads the data in line 2.
(a)
(b)
The data is almost 700,000 characters and this is far too much to print to the console,
so only a portion is printed in line 6. The first number is 0.993095 which corresponds to the
cell highlighted in Figure 9.2(a). Following that cell is a cell with the number 1.688044 and
the two are separated by a tab character which is denoted by \t. Each of the remaining
cells in that row are also separated by tabs. The row ends with a cell containing the value
of 0 and after that is the newline character
122
n. The number 883 in the last line in Code 9.1 begins the next row in the spreadsheet
which is shown in Figure 9.2(b). This is the nature of the tab delimited spreadsheet.
Each row in the spreadsheet can be isolated by the split command. Code 9.2 shows
the manner in which each row of data is separated. The output is a list name lines that
contains strings. Each string is a row of data from the spreadsheet. The cells in each row
can be separated by splitting on tab characters.
In this large spreadsheet there are three sections of data and the third section has
the raw data. This portion of is shown in Figure 9.3. The task demonstrated here is to
extract six columns: Number, Name, ch1 Intensity, ch1 Background, c2 Intensity and ch2
Background.
Figure 9.3: The portion of the spreadsheet at the beginning of the raw data.
The first step is to find the location of “Begin Data” in the original string. This
is done in line 1 of Code 9.3. Only the data after that is important to this application
and so in line 4 that portion of the spreadsheet data is split. The output, lines, is a
list of strings with each string being a row from the spreadsheet. The first item in this
list, lines[0], is row 1650 in Figure 9.3. The second row is the list of column names of
which only a few are shown in the figure. The rest of the lines in Code 9.3 find out which
columns are those of interest.
The final step is to collect the data. In this case, there are 1600 lines of data and
the real data starts in lines[2]. So the loop in line 2 of Code 9.4 is over those lines.
Line 3 splits the cells on tabs and line 4 extracts only those columns that are of interest.
It also converts strings to integers or floats as necessary. Each of these lists of data are
appended to the big list gsmvals. The first row is shown. From here the user can perform
the analysis on the data.
Python installations come with the csv module that has the ability to read files save in
the CSV format. The advantage of this module over the previous method is that it can
123
Code 9.3 Determining the columns.
1 >>> gsmvals = []
2 >>> for li in lines [2:1602]:
3 temp = li . split ( ' \ t ' )
4 tlist = [ int ( temp [0]) , temp [5] , float ( temp [8]) , float (
temp [9]) , float ( temp [20]) , float ( temp [21]) ]
5 gsmvals . append ( tlist )
6 >>> gsmvals [0]
7 [1 , ' phos phodie steras e I / nucleotide pyrophosphatase 2 (
autotaxin ) ' , 3077.651611 , 1083.671875 , 1107.415527 ,
374.328125]
124
handle multiple formats in which the data is saved.
Code 9.5 shows the use of this module on the same file that was used in the previous
section. The file is opened in a normal manner as shown in line 2. Line 3 defines a new
variable as a csv reader. In this case the delimiter is clearly defined as the tab character.
Without that declaration, commas will also be treated as a delimiter and as there are
commas in some gene names this will cause incorrect reading of the data.
Line 4 creates an empty list that is populated in lines 5 and 6. These two lines
convert the data so that each row from the spreadsheet is a list of strings. Each cell is a
string in that list.
To replicate the extraction of data performed in the previous section the first task
is to find which row contains the phrase “Begin Data”. This is performed in lines 7 and 8
and the result indicates that this is in ldata[1649]. The last two lines extract the same
six columns of data as in the previous method.
9.3 xlrd
There are two modules that are reviewed here that have the ability to read an Excel file
directly. These modules have several functions, but only those necessary for reading a
file are shown here. It should be noted that the two previous methods could only read
the data of a single page, while these next two modules can also read all pages, formulas
and other entities in the spreadsheet. Neither of these modules comes with the native
version of Python and users may have to download and install them. They do, however,
are contained with packages such as Anaconda.
The first module is xlrd which can read the older style of Excel files that come with
125
the extension “.xls.” The example is shown in Code 9.6. The file is opened in line 2.
Lines 3 and 4 show the list of page names. In this case, the spreadsheet has only one page
named “Export”.
Line 5 extracts the data from the specified page and line 6 shows the extraction of a
single row of data. This is a list and in this case this list has 34 items. There is one item
for each cell in that spreadsheet row. The last two lines show how to retrieve the content
of the first cell in the first row.
Code 9.7 shows the use of this module. Lines 1 and 2 indicate that the sheet has 3258
rows. Lines 3 through 7 find the one row with the string “Begin Data”. The rest of the
lines convert the data to a list for further processing. Note that numbers are automatically
converted to floats in this process.
9.4 Openpyxl
The openpyxl module offers routines to read the XLSX file format. Code 9.8 shows the
process of loading the file and getting the page names in lines 1 through 3. Access to the
cells is shown in lines 4 through 7. Use of the active sheet is shown in the last lines.
Code 9.9 shows that the variable ws.rows is just a tuple and that a single row such
as ws.rows[0] is also a tuple. Therefore, they can be accessed through numerical indexes
as shown in the final lines.
9.5 Summary
This chapter demonstrated four possible ways of accessing data contained in a spreadsheet
from Python. The first two required that the user save the spreadsheet information as a
126
Code 9.7 Converting the data.
127
Code 9.9 Alternative usage.
CSV file and the last two read directly from the spreadsheet.
128
Chapter 10
When it became possible to sequence parts of the genome several companies created se-
quencing machines. One such sequencer was made by ABI and it had the ability run a
few dozen experiments at one time. This machine produced a data file that had several
components and this chapter will explore the methods needed to read this file. All of the in-
formation about this file was obtained from the published ABI documentation.[ABI, 2016]
In a single sample of DNA contained a large number of DNA strands, each starting at the
same location in the genome. However, the lengths were varied. One of four dyes were
attached to the end of the strands depending on the last base in the sample. Thus, if the
dye could be detected then it is possible to determine the last base in a strand.
The next step is to separate the strands by length. This was performed by sending
the strands through a gel. Longer strands encountered more resistance and therefore
traveled slower through the gel. The gel was kept between to plates of glass and oriented
vertically so that the sample went from the top of the gel to the bottom by the aid of an
electric potential and gravity. At the bottom of the gel was a laser that would illuminate
the dyes as they passed through and an optical detector that would receive the fluorescence
from dye. In these machines the gel was wide enough to run a few dozen samples at the
same time. Each set of samples ran down a lane and the laser could illuminate all of the
lanes.
The information for each lane was saved in a separate file. This file contains infor-
mation about the experiment as well as the data from the experiment. Since there are four
nucleotides in DNA there were four dyes and therefore each experiment had four channels.
One channel from one experiment is shown in Figure 10.1. The x axis corresponds to time
and each peak is the presence of this dye at a given time. There is also a bias as this
sample does not go to 0 when the dye is not present.
129
Figure 10.1: One channel from one lane.
This experiment also collected almost 8000 data points. Concurrently, it was col-
lecting data points for the other three channels. Figure 10.2 shows a small segment of
the experiment with all four channels. Dyes react to a range of optical frequencies and
therefore activity in one channel can also be seen in another. This occurs at locations were
two channels have a peak at the same time. Also noticeable is that each channel has its
own baseline and as seen in Figure 10.1 this baseline changes in time.
A deconvolution process is applied to clean the data. The same segment of signal is
shown in Figure 10.3 after the deconvolution process was applied. As seen each channel
has a baseline at y = 0 and only one peak is available at any time.
130
Figure 10.3: The same signal after deconvolution.
The final step was to call the bases. In this case the red lines is associated with
G, green with A, blue with T and violet with C. Thus, in this segment that calls are
ACTATAGGGCGAATTCGAG.
The data files contain a lot of information, but the intent of this chapter is to
demonstrate how to read data files. Thus, extracting all of the information will not be
performed. Instead the only retrievals will be the raw and cleaned data as well as the base
calls. The rest of the information can be retrieved in manners very similar to those shown
in this chapter.
10.2 Hexadecimal
Before the data is extracted it is necessary to understand two forms of numerical represen-
tation. People use a base 10 system. A number greater than 10 requires two digits. One
digit uses the ones column and the other uses the tens column. Numbers greater than 100
use the hundreds column and so on.
While this system was logical for humans that mostly had ten fingers, it was not
well suited for computer use. Computers actually can only store information in a binary
format. Each bit of memory is represented by either a 0 or 1.
A byte of memory is eight bits and therefore can represent 256 different values.
A word is two bytes or 16 bits. A word can represent 65,536 different values. Modern
computers are 32 or 64 bits.
Decimal notation is cumbersome and easily prone to errors, so the hexadecimal
system is often employed. This is a base-16 system and the conversion is shown in Table
10.1. The digits 0 through 9 are the same in hexadecimal as in decimal and so they are
131
Table 10.1: Hexadecimal Values.
Hexadecimal Decimal
0 0
9 9
A 10
B 11
C 12
D 13
E 14
F 15
The ABI file is rather large and contains a plethora of information. Programs such as
hexdump can show the raw contents of a file easily. This program comes with UNIX
and OSx operating systems and is called from the command line with the command hd
filename. Hexdump programs for Windows are available but care should be used when
downloading executable programs from websites.
Figure 10.4 shows the beginning of the hexdump for an ABI gel file. The left column
132
is the location in the file represented in hexadecimal notation. So, line 1 starts at 00000000
and line 2 starts at 000000010 which is location 16d. The next segment of a line shows
sixteen bytes in hexadecimal notation. The last column is the ASCII notation for the file.
A computer can only store numerical values and there is an ASCII which associates each
letter with a numerical value. The last column in the display shows the ASCII equivalents
for each byte. Not all bytes have an associated character and so periods are used.
Like many file formats the first few bytes denote the type of file. In this case the
first four bytes are 41 42 49 46. These four bytes represent the characters ABIF from the
ASCII table.
Python offers two functions that can convert between characters and numerical
values. It also provides tools that convert between hex and decimal notations. These are
shown in Code 10.1. Line 1 uses the hex command to convert the decimal value 65 to its
hex equivalent which is 41h. Python represents a hex number with 0x. In line 3 the user
enters the hex value with this notation and the decimal value is returned. The first byte
in the file was 41h and line 5 uses the chr command to return the associated character
which is a capital ‘A’. The second byte in the file was 42h and that is returned as ‘B’. The
ord function finds the decimal value for a given letter.
The first four bytes in the file are ABIF and the next two bytes represent the version
133
Table 10.2: ABI Record.
identifier. This is an unsigned integer. Code 10.2 shows that this file used version 101.
There are many bytes filled with FF and the hexdump program will place an asterisk
in a line to show that this is just the same set of bytes in all of the missing rows.
This file relies on the use of a record which was defined by ABI. A record consists of 28
bytes in the format shown in table 10.2.
The first record starts just after the ABIF and version number which is location six.
Code 10.3 shows the steps to read the bytes for the first record. The file is opened in line
1 and the file pointer is moved to location 6 which is the beginning of the first record.
Line 4 reads the 28 bytes of the record but does not interpret them. The first four bytes
are the name of the record and are shown in line 6. The name of the record is “tdir” and
the ‘b’ that precedes them is the Python indication that this is a series of bytes.
Python has a module named struct which can conveniently convert bytes read from
a file to the desired format. This module has two main functions pack and unpack. The
134
latter is applied to the record in line 2 of Code 10.4. The second argument is a[4:] which
uses all of the bytes of the record except for the first 4. These have already been used to
return the name of the record. As shown in Table 10.2 the rest of the data is either 32 or
16 bit integers. The letter for an unsigned 16 bit integer (also call unsigned short) is ‘H’
and for an unsigned integers is ‘I’. Thus the string ‘IHHIIII’ interprets the data as shown
in Table 10.2. The symbol ‘>’ indicates that the data is big endian. The unpack function
has many symbols that can be used and the reader is encouraged to view these options in
the Python manual pages at https://docs.python.org/3.5/library/struct.html.
The unpack function returns 7 numbers accordingly. The information that is im-
portant here is that there are 56 records and the starting location in the file of these
records is 128,947d which is also 01 F7 B3 in hex.
Figure 10.5 shows this location in the hexdump. The second row begins with 00 01
F7 B0 and so the starting location is the third byte in from the left. As seen in the right
column this corresponds to the record named AUTO. Every 28 bytes there is a new record
and some of their names are visible in this figure.
There are 56 records in this file and only a few are of interest here. There are 12
records named DATA. The first four are the four raw data channels as shown in Figures
10.1 and 10.2. The next four contain the information used in the deconvolution process
and the last four contain the cleaned data such as shown in Figure 10.3. The other record
of interest is named PBAS which contains the base calls.
135
10.3.2 Extracting the Records
Code 10.5 shows the ReadRecord function function which reads and interprets a single
record following the previous prescription. This function is called 56 times for this file
gathering information from all the records.
10 >>> recs = []
11 >>> k = 128947
12 >>> for i in range ( 56 ) :
13 recs . append ( abigel . ReadRecord ( fp , k ) )
14 k += 28
15 >>> recs [35]
16 ( b ' PBAS ' , 2 , 2 , 1 , 576 , 576 , 128317 , 19912464)
The records are in alphabetical order and recs[35] is the PBAS record. This record
indicates that the data starts at location 128317 and that there are 576 bytes of data.
Code 10.6 shows the movement of the file pointer and line 3 reads the ensuing 576 bytes.
These are bases as called by the ABI software. The ‘N’ letters indicate that a base exists
but there was not enough information to probably call the base.
The first two records for the data is shown in Code 10.7. The first number indicates the
record number, so these are 1 and 2. The second value is 4 and this indicates that the
data is a signed 16 bit integer (see the ABI manual starting on page 13). The next value
is 2 which indicates that a 16 bit integer is 2 bytes. The third number is 7754 which is
the number of data samples and since each sample is 2 bytes the total number of bytes
136
Code 10.6 The base calls.
1 >>> fp . seek ( 128317)
2 128317
3 >>> calls = fp . read ( 576 )
4 >>> calls
5 b' TNNGAATTGCATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCTC
6 TAGAGTCGACCTGCAGGCATGCAAGCTTGAGTATTCTATAGTGTCACCTAAATAGCTTGG
7 CGTAATCATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACA
8 ACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCA
9 CATTAATTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGC
10 TTAATGAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGCTCTTCCGCTTC
11 CTCGCTCACTGACTCGCTGNGCTCGGTCGTTCGGCTGCGGCGAGCGGTATCAGCTCACTC
12 AAAGGCGGGTAATACGGGTTATCCACAGGAATCAGGGGATAACGCAGGAAAGACATGTGA
13 GCAAAAGGGCAGCAAAAGGGCAGGAACCCTAAAAAGGCCGCGTTGGTGGGNTTTTCCATA
14 GGGTCCCCCCCCTGANGAGATAAAAAANCGAGGTCAC '
is 15508. The next numbers are the locations of the data. So, the first channel starts at
location 453 in the file and the second channel starts at location 15961.
Code 10.8 shows that the file pointer is moved to the location 453. Line 3 reads in
the bytes and line 4 converts them all to big endian, signed 16 bit integers. The use of
‘7754h’ indicates that there are 7754 signed 16 bit integers to be decoded. The last lines
show the first ten values which are the first ten values in the plot in Figure 10.1.
The data can be saved use the function Save from the gnu module. Then either
GnuPlot or a spreadsheet can plot the data.
137
10.3.3 Cohesive Program
Already the ReadRecord function has been presented but other functions are needed to
automate the reading of the Gel file. Code 10.9 shows the ReadPBAS functoin function
which is used to search the records for one named PBAS and then extracting the called
bases. The inputs are the file pointer and the records which are read by ReadRecord.
The ReadData function shown in Code 10.10 reads the data that were shown in
plotting. The inputs are the file pointer and the previously read records.
There are 12 entries named DATA and the first four and last four are the desired
arrays of data. Thus, line 4 creates an integer k. It is incremented after each iteration in
line 14. The data is extracted only if k is less than 4 or greater than 7 as in line 7. Line 9
moves the file pointer and line 8 retrieves the data. The first four data channels have the
same length, but this length is different than the length of the last four channels. Thus,
138
the instruction to struct.unpack must be created. The variable g in line 11 is a string
that is the instruction for unpack. Then line 12 executes that command converting the
data to signed integers. The list data contains four channels of raw data and four channels
of cleaned data.
The SaveData shown in Code 10.11 saves the channels in eight different files. The
input is the data returned from the previous function and the pname is the partial file
name which has a dummy default value. In this case the files will be stored as dud0.txt,
dud1.txt etc. The file name is created in line 4 and the Save function from the gnu module
is called to save the file in a text format that is readable by GnuPlot or a spreadsheet.
The final function is named Driver which performs all of the tasks. Line 3 opens
the file for reading binary data and the first record is read in line 4. This indicates the
location of the other records, loc, and the number of records, nrec. These are read in
line 9 and appended into a list in line 10. Then the functions that read the calls and data
are accessed. The final step is to save the data to the disc for viewing and to return the
data to the user.
139
Problems
2. Show that there are no bytes between the end of the raw data for the first channel
and the beginning of the data for the second channel.
140
Chapter 11
Python Arrays
The original Python has several powerful packages but was missing the ability to efficiently
handle vectors, matrices and tensors. Two third party packages, numpy and scipy, offer
these abilities along with an extensive scientific library. This chapter will review some
of the basics but will come woefully short of being an extensive library of the available
functions.
Python uses the word array to be a collection of similar data types. This includes
a vectors, matrices and tensors. This text, though, may delineate these mathematical
entities even though Python simply refers to them as arrays.
11.1 Vector
There are three other methods by which vectors can be created. Line 1 in Code
11.2 uses the ones command to create a vector where all of the values are 1 instead of 0.
141
Line 4 uses the array command to create a vector from user defined data. Line 7 uses
the random.rand command to create a vector of random numbers with values between
0 and 1.
Often the precision of the arrays that are printed to the console are too long
and so the print precision can be controlled as shown in Code 11.3. This uses the
set printoptions function to set the nature of the output. This only affects the printing
of the values and not the precision used in computations.
11.2 Matrix
The same functions that create a vector can be used to create a matrix. The zeros
function is shown in Code 11.4. In this case the argument to the zeros function is the
tuple (2,3). This defines the vertical and horizontal dimension of the matrix.
There is a difference in generating a random matrix in that the ranf function is used
instead of the rand function. This is shown in Code 11.5.
142
Code 11.4 Creating a matrix.
1 >>> M = np . zeros ( (2 ,3 ) )
2 >>> M
3 array ([[ 0. , 0. , 0.] ,
4 [ 0. , 0. , 0.]])
11.3 Slicing
Slicing of a vector behaves in the same way as does slicing for tuples, lists and strings.
Slicing for a matrix is different since there are multiple dimensions. Line 1 in Code 11.6
extracts the value from the first row and the first column. Line 3 extracts all of the values
from the first row. Line 5 extracts al of the values from the second column.
1 >>> M [0 ,0]
2 0.189 48 1 7 13 7 9 57 5 4 85
3 >>> M [0]
4 array ([ 0.189 , 0.736 , 0.668])
5 >>> M [: ,1]
6 array ([ 0.736 , 0.449])
Code 11.7 gets a sub-matrix from the original. In this example, the command
extracts the rectangle that includes rows 1 & 2 and columns 2 & 3.
The nonzero function returns locations of values that are not zero in an array. In
the example in Code 11.8 line 3 compares each value in the vector to 0.5. The answer
show in line 4 places a True or False in the elements were the condition was met or failed.
In line 5 the nonzero function is added. This will return the positions in which the True
value was returned. For vectors it is necessary to put the [0] at the end of the function.
The answer in line 6 is a vector that indicates that the result from line 5 was True in
positions 1 and 2.
The return from the nonzero function is a vector and that can be used to slice an
array. Line 1 in Code 11.9 is the same as line 6 in Code 11.15 except that the answer is
returned as a variable x. This x is a vector. In line 2 this vector is used as the index for
143
Code 11.7 Extracting a sub-matrix.
1 >>> vec
2 array ([ 0.033 , 0.659 , 0.958])
3 >>> vec > 0.5
4 array ([ False , True , True ] , dtype = bool )
5 >>> ( vec > 0.5) . nonzero () [0]
6 array ([1 , 2] , dtype = int64 )
the original data vector. The result is that the this command returns the data that was
at positions 1 and 2.
This feature is called random slicing because it has the ability to extract the elements
from an array in any specified order. Consider the first 6 lines in Code 11.10. Each one
extracts from of the elements from the array named P. Lines 7 and 8 create lists which are
the coordinates that were used in line 1, 3 and 5. Line 9 uses those coordinates to pull
out the same three values in a single command.
The advantage of arrays is the speed in which the computations can be performed. Con-
sider the simple task of adding the values of two matrices to create a third matrix. Lines
1 and 2 in Code 11.11 create 2 matrices. Lines 3 through 6 show the method of adding
144
Code 11.10 Modifying the matrix.
1 >>> P [0 ,1]
2 0.833 06 1 8 62 3 8 23 6 7 24
3 >>> P [1 ,1]
4 0.044 9 2 9 9 8 1 1 2 0 3 1 1 1 0 2
5 >>> P [3 ,0]
6 0.631 36 4 7 38 3 1 27 5 3 42
7 >>> v = [0 ,1 ,3]
8 >>> h = [1 ,1 ,0]
9 >>> P [v , h ]
10 array ([ 0.833 , 0.045 , 0.631])
the two matrices together. The only problem with this approach is that it is slow. Line
8, on the other hand, is simpler to write and has a much faster execution time.
8 >>> m3 = m1 + m2
Code 11.12 creates two vectors in lines 1 and 2. Line 7 shows that in a simple
command that several additions are performed.
Without arrays, the Python programmer would be required to perform this addition
with a for loop. The for loop does exist in line 7 but it is in the Fortran code that is
called when two arrays are added together. Of course, vectors can be subtracted and
multiplied as shown in Code 11.13.
The multiplication shown in line 4 is an elementary multiplication, meaning that each
element is multiplied by the respective element in the other vector. There are actually
four ways that two vectors can be multiplied together. The others are dot product, outer
product and cross product.
The dot product is also called the inner product and the answer is a single scalar
value. The notation is,
v = ~a · ~b.
The Python script that performs this operation is shown in line 1 of Code 11.14.
145
Code 11.12 Addition of arrays.
1 >>> c = a -b
2 >>> c
3 array ([ 0.036 , 0.503 , 0.648])
4 >>> c = a * b
5 >>> c
6 array ([ 0.435 , 0.113 , 0.274])
Numpy also provides a function named dot that performs the same computation. Actually,
this function can also compute the outer product, the matrix-vector product, and the
vector-matrix product.
1 >>> ( a * b ) . sum ()
2 0.821079 8 2 33 8 3 93 9 9 2
3 >>> a . dot ( b )
4 0.821079 8 2 33 8 3 93 9 9 2
146
Code 11.15 Matrix dot product.
1 >>> M
2 array ([[ 0.671 , 0.058 , 0.095 , 0.359] ,
3 [ 0.287 , 0.644 , 0.793 , 0.962] ,
4 [ 0.501 , 0.279 , 0.294 , 0.557]])
5 >>> M . transpose ()
6 array ([[ 0.671 , 0.287 , 0.501] ,
7 [ 0.058 , 0.644 , 0.279] ,
8 [ 0.095 , 0.793 , 0.294] ,
9 [ 0.359 , 0.962 , 0.557]])
10 >>> M . T
11 array ([[ 0.671 , 0.287 , 0.501] ,
12 [ 0.058 , 0.644 , 0.279] ,
13 [ 0.095 , 0.793 , 0.294] ,
14 [ 0.359 , 0.962 , 0.557]])
147
matrix (same dimension in the horizontal and vertical) and line 2 computes the inverse of
that matrix. The matrix-matrix multiplication of a matrix with its inverse produces the
identity matrix which has ones down the diagonal and zeros everywhere else.
There are also a large set of standard math functions that can be applied to a matrix.
In all cases these are applied to each element of the array. Examples are shown in Code
11.18. There are many more functions than shown in the code.
11.5 Information
There are several functions that extract information from an array. Some of these are
shown in Code 11.19. Line 1 computes the sum over the vector, line 3 computes the
average of the vector values and line 5 computes the standard deviation values.
There are also functions to compute the max, min and mode.
A matrix has the same functions but there are choices that are available. For
example, the sum function can be used to compute the sum of all of the elements in the
matrix as shown in line 1 in Code 11.20. It is also possible to compute the sum of the
148
columns as shown in line 3. Line 5 sums across the rows. The argument to the sum
function is the axis of the array. For a matrix the first axis is the vertical dimension and
the second axis is the horizontal dimension and thus they are assigned the values 0 and 1.
This logic applies to all of the functions that are shown in Code 11.19.
The max function returns the maximum value in an array but it does not indicate
where the maximum value is. The argmax function is used to get that information.
Consider the example in Code 11.21 where line 1 creates a vector of random numbers.
Line 4 returns the maximum value. Note that line 5 shows more precision than line 3
because the set printoptions command applies to arrays whereas line 5 is just a float.
The argmax function is applied in line 6 and this indicates that the maximum value is
at location 4. Line 8 prints this element.
There is also an argmin and an argsort function. The argmin function behaves
just as the argmax function except that it seeks the minimum value. The argsort
function returns the sort order of the data as seen in Code 11.22. This result indicates
that the lowest value is w[1], the next lowest value is at w[0], and so on. The highest
value is at w[4].
The argsort function for a matrix requires a bit of decoding. It returns a single
value as shown in line 5 of Code 11.23. This value is the cell position in the matrix and
can be decoded to reveal the row and column position of the max. The row number is the
division of the argmax value by the number of columns. In this case 5 (the number of
columns) goes into 6 (the argmax) 1 time. Thus, the max is on row 1 (the second row).
149
Code 11.22 Using argsort.
1 >>> w . argsort ()
2 array ([1 , 0 , 2 , 3 , 4] , dtype = int64 )
3 >>> w [1]
4 0.377505 5 5 67 4 1 91 9 6 6
5 >>> w [2]
6 0.799978 9 9 61 2 3 10 5 9 7
The remainder of this division (6 ÷ 5) is also 1, so the location of the max is in column
1. In this case, the max is at Q[1,1]. Both the division and remainder can be computed
by the divmod function as shown in line 6. This one command returns both the division
and the remainder which are also the vertical and horizontal position of the max.
The task is to gather all of the random numbers that are above the value of 0.5. Line 1 in
Code 11.24 performs half of the work. The array P is compared to a value of 0.5. All of
the elements that pass that test are set to True and the nonzero function extracts their
positions. Lines 2 through 7 shows these positions. Line 8 performs the other half of the
work. The positions which were stored in v and h are used as indexes and the values at
those locations are captured by the variable vals. The numbers in the vector vals are all
of the numbers in P that were greater than 0.5.
150
Code 11.24 Extracting qualifying values.
11.7 Indices
Consider a case where the task is to exam pixels that surround a face in an image. There
are several steps required to accomplish this task. First the image is converted to a matrix
(next chapter) and then a face-finding algorithm is applied. The face-finding algorithm is
not perfect and will have false positives and therefore it is necessary to analyze the pixels
that surround the suspected face. This chapter considers the problem of extracting just
those pixels as shown by the circle in Figure 11.1.
The indices function creates two matrices as shown in in Code 11.25. One of the
matrices has increasing values down the rows and the other has increasing values across
the columns.
This is an extremely useful function that can be the foundation of isolating elements
in a matrix. Consider Code 11.26 which subtracts an integer from each matrix. This shifts
the row and column that contain 0 to new locations. The values in the first matrix are
the distances from the 0 row and the values in the second matrix are the distances to the
0 column.
Recall that the equation to compute a linear distance is,
p
d = x2 + y 2 . (11.2)
There is a single element in the two matrices that have both a 0 in the column and
the row. This is the defined center and the distance from that point to any other point is
computed by the Euclidean distance equation above.
All of these distances can be computed in a single command as show in line 1 of
Code 11.27. The output d is a matrix and each element contains the distance to the center.
151
Figure 11.1: Isolating the pixels about the face.
152
Code 11.26 Shifting the arrays.
1 >>> a-2
2 array ([[-2 , -2 , -2 , -2 , -2] ,
3 [-1 , -1 , -1 , -1 , -1] ,
4 [ 0, 0 , 0 , 0 , 0] ,
5 [ 1, 1 , 1 , 1 , 1] ,
6 [ 2, 2 , 2 , 2 , 2]])
7 >>> b-2
8 array ([[-2 , -1 , 0, 1, 2] ,
9 [-2 , -1 , 0, 1, 2] ,
10 [-2 , -1 , 0, 1, 2] ,
11 [-2 , -1 , 0, 1, 2] ,
12 [-2 , -1 , 0, 1, 2]])
153
The purpose of this code is to compute the average of the values that are within a
distance of 10 from a defined central point.
This example uses a very small matrix, but now consider a much larger matrix that
goes through the same process. In the next chapter, images will be loaded and the pixel
values will be converted to a very large matrix. As an example, the programmer wants to
gather all pixels within a specified distance to a defined point. Defining those points can
be done by the method that was just described.
A smaller version is shown in Code 11.28 where the input data is created in line 1.
The desire is to define the central point at (50,40) which is not the center of the matrix.
However, it is from this point that we wish to gather all of the elements that are within a
distance of 10. Lines 2 through 4 create the two matrices that will be used to calculate the
distances as in line 5. The matrix dist contains distances from each element to the defined
point (50,40). Any pixel that is a distance less than 10 is one that is to be gathered. The
matrix d computed in line 6 has elements that are True if the distance to the central point
is less than 10.
Line 7 collects the coordinates of these points and line 8 collects the values of the
pixels and computes the average.
154
11.8 Example: Simultaneous Equations
3.1x + 2.8y = −1
and
1.2x − 0.9y = 3.
The task is to find the values of x and y that satisfy both equations. This can be
solved by a matrix inverse. The matrix are the coefficients (numerical values on the left),
3.1 2.8 x −1
= .
1.2 −0.9 y 3
The unknowns are the x and y and so the task is to isolate them from all other
components. This is accomplished by left-multiplying both sides of the equation by the
inverse of M. Then the equation becomes,
x −1 −1
=M .
y 3
Thus, the solution to x and y can be obtained by computing the inverse of the matrix
and then performing a matrix-vector multiply. The result is a vector and those elements
are x and y. The solution is shown in Code 11.29. The values of x and y are 1.22 and
-1.707. Line 9 checks the result using 3.1x + 2.8y = −1.
155
Figure 11.2: The electric circuit.
This has very practical uses. Consider the electronic circuit shown in Figure 11.2.
The problem gives the values for the resistors and the voltages. The task is to solve for
the current that goes through each resistor.
The solution for this follows Kirchhoffs laws which produces three equations,
I1 + I2 − I3 = 0
−I1 R1 + I2 R2 = −E1 + E2
and
I2 R2 + I3 R3 = E2 .
Here there are three equations and three unknowns (the currents I). Thus, a 3 × 3 matrix
is constructed from the coefficients.
Problems
1. Create a vector of 1000 random numbers. Compute the average of the square of this
vector.
2. Create a 5 × 5 matrix of random numbers. Compute the inverse of the matrix. Show
that the multiplication of the inverse with the original is the identity matrix.
3. Create a vector of 1000 elements ranging from 0 to π. Compute the average of the
cosine of these values. This should be performed in two lines of Python script.
4. Create a 5 × 4 matrix of random values from ranging from -1 to +1. Compute the
sum of the rows.
5. Create a 100 × 100 matrix using a random seed of your choice. Using divmod find
the location of the maximum value. Print the random seed, the location of the max
and the value of the max.
156
6. Given 1.63x − 0.43y = 0.91 and 0.64x + 0.87y = 0.19. Write a Python script the
method of simultaneous equations to determine the values of x and y.
157
158
Chapter 12
A function is used to contain steps that are used repeatedly. Instead of writing each
individual line of code, the user only needs to call on the function. Several functions
have already been used, but this chapter will demonstrate how functions and modules are
created.
12.1 Functions
Code 12.1 shows a bare-bones function. Line 1 uses the def keyword to declare the
creation of a function. The name of the function in this case is MyFunction. It does not
receive any inputs (empty parenthesis) and the declaration is followed by a colon. Line
2 is indented and it is the first command inside of the function. Line 3 is also indented
and therefore it is also a command inside the function. In most editors simply typing two
Returns will end the indentation and thus end the creation of the function.
The function has been created but has not been executed. Line 5 is at the command
prompt in the Python shell and the function is now called. Lines 6 and 7 show that the
commands inside of the function are executed. It is required to have the parenthesis after
the call to the function as shown in Line 5. If these are omitted then Python will return
159
information about the function but will not run its commands.
5 >>> MyFunction ()
6 Inside
7 the function
Consider Code 12.2 which defines a variable inside the function in Line 2. The function
is called in Line 4 and in Line 5 there is an attempt to access the variable ab. However,
an error is created. The variable ab is a local variable since it is declared inside of the
function. It only exists inside of the function and is not accessible outside of the function.
1 def Fun7 () :
2 ab = 9
3
4 >>> Fun7 ()
5 >>> ab
6
A global variable is defined in Line 1 of Code 12.3. This is defined outside of the
function and is visible inside of the function (Line 3) as well as in the Python shell.
It is possible to declare a global variable inside of a function as shown in Code 12.4.
Line 2 uses the global function to create the global variable abc. It is available outside of
the function as shown in Line 6. The global function must be the first command inside
of the function.
160
Code 12.3 Executing a function.
1 >>> b = 9
2 >>> def Fun8 () :
3 print 7 + b
4
5 >>> Fun8 ()
6 16
1 def Fun9 () :
2 global abc
3 abc = 10
4
5 >>> Fun9 ()
6 >>> abc
7 10
12.1.3 Arguments
Inputs to a function are called arguments. Code 12.5 shows a new function which receives
a single input, which in this case is the variable ab. The function is called in Line 4 and
this time the user is required to give the function an argument. The variable ab becomes
the integer 5. The function is called again in Line 6 and this time the argument ab is the
string “hi there”.
1 def Fun1 ( ab ) :
2 print ( ab )
3
Some languages are strictly typed which has several restrictions including the dec-
laration of a variable type when used as an argument to a function. In creating like this
in a language like C++ of Java the programmer would be required to declare the data
type for ab. If it is declared as an integer then it would not be possible to pass a string
to the function. Python is loosely typed and so the variable ab assumes the data type
of the argument that is passed to it. There are advantages and disadvantages to these
161
philosophies. It is easier to have errors in a loosely typed system as the language will
allow the passing of a variable that is other than the programmer intended. However, in a
strictly typed system the programmer made need to write more functions to accommodate
multiple types of arguments that could be passed to a function.
Code 12.6 shows a function that receives two arguments that are separated by a
comma. Line 5 calls this function and as seen there are two arguments sent to the function.
Code 12.7 attempts to call this function with two different arguments. In Line 1 the two
arguments are strings and instead of adding two integers this function now concatenates
two strings. Line 2 in Code 12.6 is the command that is used to concatenate two strings.
See Code 6.33.
1 def Fun2 ( a , b ) :
2 c = a + b
3 print ( c )
4
5 >>> Fun2 ( 5 , 6 )
6 11
Line 3 attempts to call the same function and now the arguments are a string and
an integer. Python does not add an integer to a string and so an error is caused. Note
that this error indicates that the problem is in Fun2 and that it occurs on Line 2 in that
function. It even shows the offending line and provides a clue as to what the problem is.
A default argument has a default definition that can be overridden by the user. An example
is shown in Code 12.8. In Line 1 the function has two arguments and the second uses an
162
equals sign to give the variable a default value. The command is called in Line 3 and the
inputs to the function are a = 9 and b = 5 as the default value. Line 5 gives the function
two arguments and in this case b = −1. Default arguments have already been used. See
Code 7.11 in which the range function is shown with different numbers of arguments.
1 def Fun5 ( a , b =5 ) :
2 print a , b
3
4 >>> Fun5 ( 9 )
5 9 5
6 >>> Fun5 ( 9 , -1 )
7 9 -1
A default argument must be the last argument in the input stream. It is possible
to have multiple defaults as shown in Code 12.9. Here both b and c have default values.
If a function call has only 2 inputs then the second will be assigned to b. Line 10 shows a
case in which the default value for b is used and the value for c is overridden.
1 def Fun6 ( a , b =5 , c =9 ) :
2 print a ,b , c
3
4 >>> Fun6 ( 2 )
5 2 5 9
6 >>> Fun6 ( 2, 3 )
7 2 3 9
8 >>> Fun6 ( 2 , 3 , 4)
9 2 3 4
10 >>> Fun6 ( 2 , c =-1)
11 2 5 -1
Figure 12.1 shows an interaction in the IDLE shell. The user has typed in a command
and the left parenthesis. If the user waits then a help balloon appears. This provides terse
information on the arguments that can be used in the function.
The help function provides even more information on a function as shown in Code
12.10. To create help balloons and descriptions in a function the first component in the
function are these instructions as shown in Code 12.11. Line 2 starts with three double-
163
Figure 12.1: A help balloon.
quotes. In this example there are three lines of instructions and the last line ends with
three double-quotes.
4 range (...)
5 range ([ start ,] stop [ , step ]) -> list of integers
6
1 def Fun2 ( a , b ) :
2 """ First line
3 Second line
4 Third line """
5 c = a + b
6 print ( c )
Now, when the function is typed with the first parenthesis, the first help line appears
in the balloon as shown in Figure 12.2. The help function will print out all of the lines.
12.1.6 Return
The return command returns values from the function. This command is usually at the
end of the function, for when it is executed the call to the function ends. An example
164
Figure 12.2: A help balloon.
4 Fun2 (a , b )
5 First line
6 Second line
7 Third line
is shown in Code 12.13. Line 3 has the return statement. Line 5 shows the call to the
function and this time the function will return a value which is placed into d.
1 def Fun3 ( a ) :
2 c = a + 9
3 return c
4
5 >>> d = Fun3 ( 3 )
6 >>> d
7 12
One of the unusual properties of Python is that it can essentially return multiple
items. Consider Code 12.14 which shows the return function with two variables in Line
4. This function is called in Line 6 and as seen two items are returned. In reality, the
function is only returning one item which is a tuple that contains two variables. Line 11
receives only one item the following lines show that z is actually a tuple.
Functions can be designed to perform numerous tasks and creating such a function can be
difficult. The best idea is to plan the function before writing code. An example is shown in
Code 12.15. Here a function is declared followed by several comment statements. These
are the jobs that the function will eventually accomplish. For now, though, these are
merely ideas. The last line uses the pass function which does nothing. A function must
165
Code 12.14 Returning multiple values.
1 def Fun4 ( a , b ) :
2 c = a + b
3 d = a - b
4 return c , d
5
6 >>> m , n = Fun4 ( 5 , 6)
7 >>> m
8 11
9 >>> n-
10 1
11 >>> z = Fun4 ( 5 , 6 )
12 >>> type ( z )
13 < type ' tuple ' >
14 >>> z
15 (11 , -1)
have at least one command and so the pass command is put here as a place holder. Once
the real commands are entered the pass command can be removed.
Now that the function is planned it is possible to start writing Python commands.
A good practice is to perform one task at a time and then test the code. Code 12.16 shows
this by example. Line 3 is an actual Python command that will load the file. Line 9 is
then called to test the new function. No errors are returned which is one requirement for
correct code.
The commands for each idea are then placed in the function and tested. The final
result is shown in Code 12.17.
Now that the function is created it is easy to apply all of the commands therein to
separate inputs. Consider Code 12.18 which calls the function WordList in line 1. The
argument is the file that contains the text for Romeo and Juliet. It returns 25,640 unique
words. The function is called again in Line 4 and the only difference is the name of the
166
Code 12.16 Adding a command.
1 import string
2 def WordList ( fname ) :
3 # load
4 data = open ( fname ) . read ()
5 # convert to lowercase
6 data = data . lower ()
7 # remove punctuation
8 table = string . maketrans ( " ! ' &= ,.;:?[]-" ," XXXXXXXXXXXX " )
9 data2 = data . translate ( table )
10 data2 = data2 . replace ( ' X ' , ' ' )
11 # split
12 words = data2 . split ()
13 # return
14 return words
167
input file. This time 18,092 unique words are found in MacBeth.
12.2 Modules
A module is a Python file that can be created by the user. This file can contain Python
definitions, commands, declarations and functions. Basically, anything that can be typed
into a Python shell can be placed in a module. The module file is then stored for future
use.
Before modules are created it is prudent to create a proper working directory. An
example is shown in Figure 12.3. At the top it is seen that the file manager is in the
C:/science/ICMsigs directory. Inside of this directory are several subdirectories shown as
icons. This is a standard set of subdirectories for a working directory. For this discussion
the important subdirectory is named pysrc. It is this directory where the researcher
working on the ICMsigs project will place their Python modules.
When Python is started it is necessary to move it to the working directory and then
to include the pysrc subdirectory in the search for modules. The steps are shown in Code
12.19. Line 1 imports two modules. Line 2 moves Python to this researcher’s working
168
directory. Line 3 includes the pysrc subdirectory in the search path. Now, when the user
employs the import command it will also search the pysrc directory for modules.
The IDLE environment does have a code editor and new files can be created by
selecting File:New as shown in Figure 12.4. The new file is blank and ready for editing.
Python commands can be entered into the editor as shown in Figure 12.5. In this
case there is a variable definition, a function definition, and the execution of the function.
This file is stored in the pysrc directory and the extension “.py” is required.
Now, when the import function is called the module residing in the pysrc directory
is loaded as shown in Code 12.20.
The module can be altered as shown in Figure 12.6. If the module is changed then
in Python 2.7 the reload command to load the code. In Python 3.x this was modified and
now it is necessary to import the importlib module and from it call the reload function..
An alternative method for loading a module is to use the from ... import command
as shown in Code 12.22. In this case it is not necessary to type first.vara to access the
variable. However, the code not can be altered if this method is used.
169
Figure 12.5: Contents of a module.
170
Code 12.21 Reloading a module.
The final method of loading a module is to execute the file. Python 2.7 offers the
execfile command as shown in line 1 of Code 12.23. This is equivalent to typing all of the
lines in the file myfile.py into the Python shell. This function does not use the search path
and so it is necessary to define the directory location and to use the extension “.py” as
shown. This command is not available in Python 3.x and so the alternative is to read the
file using open...read and then to use the exec function to execute all of the commands
in the file.
These commands are useful when developing code that needs to be constantly tested
during development. However, once the codes are running in good shape the import
statements should be used.
12.3 Problems
1. Create a function named Aper that receives a single argument named indata. This
function should print to the console the string “The input is: ” followed by the value
of indata.
2. Create a function like the previous but it prints the value of indata three times, each
on a separate line.
3. Create a function named Larger which receives two arguments, adata and bdata.
The function should return the larger of the two values.
4. Create a function named Complement that receives a DNA string and returns the
complement of that string.
171
5. Create a function that has as its argument a default filename (such as the file for
Romeo and Juliet). The function should return the length of the file (number of char-
acters in the file). Run the function again with a different filename which overrides
the default filename.
6. Create two functions. The first is BF which receives a string and converts the letters
to all capitals. The second is BA which receives a string and removes the spaces.
Then it passes that string to BF and receives the result. The function BA should
then return the resultant string which should be all caps and have no spaces.
172
Chapter 13
13.1 Justification
A class is an entity that can contain data and related functions. The common example is
that of creating an address book. An entry in the address book would contain a person’s
name, address, telephone numbers, birth date and so on. The class would also contain
functions that manipulate this data. These functions could be as simple as entering data
or storing the data on a disk. The functions should be those that operate on a single
address book entry and not on a group of entries.
An object is an instance of a class. In the example of an address book, one entry
is for a person named Matthew Luedecke and another for Aimee Harper. Each of these
persons requires their own instance of an address card. So, in this example there are two
objects of the address book class.
Classes can also be built on other classes. Thus, if a company was putting together
a database of employees and customers then it is possible to use the address book class
as a building block. Both employees and customers have the information of an address
book but they also have information that is unique. Employees could have information on
their pay rate and rank, whereas the customers could have information on their purchase
173
history. Both, though, would need the information from the address book. In this case, an
address book class is created. Then an employee class is created that inherits the address
book class. The programmer creating the employee class would not need to program the
functions that deal with an address book. These classes are building blocks for a larger
programmer which are easier to create and much easier to maintain than traditional coding.
There are drawbacks to the use of classes particularly in Python. Objects tend to
run slower than other methods of programming. In a scripting language like Python there
is also the inconvenience of persistence. Consider a case in which a programmer writes
function F1 which produces data for a second function F2. However, after running F2
the user decides that there is an error that needs to be fixed and then F2 needs to be run
a second time. In Python this is easily accomplished without requiring that F1 be run
again. If all of these functions and data were contained within a class then the situation
is different. The function F2 in a class is rewritten but then the entire object will need
to be reloaded which will eliminate any data stored in the previous instance of the class.
That means that F1 would have to rerun to generate the data stored in the new instance
of the object. So, during the code development stage, a Python programmer may find
more convenience in developing the functions without using object-oriented skills. Once
the functions are bug-free then a class can be created.
Data and functions are the two basic components of a class. Both should be dedicated to
the purpose of the class. In the example of the address book, both the data and functions
should be dedicated to the contents and manipulation of a single entry in the address book.
Functions that deal with with multiple address entries or the use of address information
for analysis using non-address data should exist elsewhere. Adherence to this restriction
is paramount in keeping large programs organized.
A very simple class is shown in Code 13.1. Line 1 shows the keyword class which indicates
that a class is being defined. This is followed by the user defined name for that class.
Following that are definitions of data and functions. In this case there is only a single
function which starts on line 2. Note that this is indented thus indicating that the function
is a part of the class.
The first argument in every function is a variable named self which is discussed
in the next section. Following that are the input variables. This function does very little
except that it sets a variable named self.a to the value of the input variable ina. Line
5 shows the creation of an instance of the class. The variable v1 is not a float or integer,
but rather it is a MyClass. It contains a single function which is called in line 6. Note
that there is only one argument in this call. The variable self does not receive an input
174
Code 13.1 A very basic class.
5 >>> v1 = MyClass ()
6 >>> v1 . Function (4)
7 >>> v1 . a
8 4
9 >>> v2 = MyClass ()
10 >>> v2 . Function (-1)
11 >>> v2 . a-
12 1
argument from the user. Line 7 shows that the v1 now contains a variable named a which
has a value of 4. Starting in line 9 is the creation of a second instance of MyClass.
Actually, this type of usage has been seen before. Code 13.2 shows the string find
function. The string a has data and associated functions such as find.
13.2.2 Self
Perhaps the most confusing aspect of object-oriented programming is the concept of self
(or *this in C++ or this in Java). Since a class may have several instances it is important
to delineate the variables inside of a function. Consider the class shown in Code 13.3 in
which Line 2 defines a variable that belongs to the class. Line 3 defines a function that will
receive a second instance of the class and add their two variables. Consider line 7 which
creates the object m1 and sets the variable to a value of 5. Line 9 creates a second object
and line 10 sets its variable to 9. Line 11 calls the function. This function belongs to m1
and the input to the function is m2. In line 4 the self.a is the variable for m1 because this
call to the function belongs to m1 (from line 11). The variable mc.a in line 4 is associated
with m2. So, in this example, self.a = 5 and mc.a = 9.
175
Code 13.3 Demonstrating the importance of self.
7 >>> m1 = MyClass ()
8 >>> m1 . a = 5
9 >>> m2 = MyClass ()
10 >>> m2 . a = 9
11 >>> m1 . Function ( m2 )
12 14
A local variable is one that exists only inside of a function and a global variable is one that
can be seen outside of the function. Consider Code 13.4 which has a function inside of
the class. The variable c is a global variable and is accessible to all functions inside of the
class as well as accessible outside of the class. As shown line 5 the access inside of the
function uses self.c. Line 16 shows access to the variable in the object.
7 >>> v1 = MyClass ()
8 >>> v1 . Function (4)
9 >>> v1 . a
10 4
11 >>> v1 . b
12 Traceback ( most recent call last ) :
13 File " < pyshell #148 > " , line 1 , in < module >
14 v1 . b
15 AttributeError : ' MyClass ' object has no attribute ' b '
16 >>> v1 . c
17 0
176
The variable self.a is also a global variable since it has self. in its declaration.
The variable b, on the other hand, is a local variable. It is used inside of the function
and once the program exits the function the variable ceases to exist. As seen in line 11
an attempt to access this variable results in an error because it was destroyed when line
8 finished its execution.
There are several predefined operators in Python. For example, the addition of two floats
uses the plus sign which is an operator. Somewhere along the line the computer must have
a definition of what to do when it sees the combination of a float, a plus sign, and a float.
It is possible to define the operator for a class. Consider the case of the address book
entries. One entry was for Aimee and another for Matthew. Theoretical code (code that
does not exist) is shown in Code 13.5. The address book entries for Aimee and Matthew
are created and in line 6 they are added. The programmer would have to define what is
meant by the addition of two addresses. Perhaps the function will create a new person
taking the first name from one person and the last name from the other. In fact, line 7 is
an overload of the string function which is used by print.
Code 13.5 Theoretical code showing implementation of a new definition for the addition
operator..
1 # theoretical code
2 >>> person1 = AddressBookEntry ( )
3 >>> person1 . SetName ( ' Aimee ' , ' Harper ' )
4 >>> person2 = AddressBookEntry ( )
5 >>> person2 . SetName ( ' Matthew ' , ' Luedecke ' )
6 >>> clone = person1 + person2
7 >>> print ( clone )
8 Aimee Luedecke
A simple example is shown in Figure 13.6 with the function add that does have
two underscores before and two after the name add. This function will define the addition
operator for the class. This operator must receive one argument besides self which is the
data from the right side of the plus sign. Line 5 creates the class and line 6 changes the
value of the variable. Line 7 uses the plus sign. The value of 6 is to the right of the plus
sign and so d = 6 in line 4. Since v1 is to the left of the plus sign, the self.a will be
v1.a.
There are many different operators that can be overloaded. Table 13.1 shows a
subset of the possibilities.
Code 13.7 shows four more overloaded functions that are not in the above table.
The first one is init which is the constructor function. This function is automatically
177
Code 13.6 Overloading the addition operator.
178
called when an object is created. Line 14 creates an object and that line will also execute
line 3 which creates a list with N entries that are all 0.
14 >>> v1 = MyVector (5 )
15 >>> v1 [1] = 9
16 >>> v1 [1]
17 9
18 >>> print ( v1 )
19 Values : : 0 : 9 : 0 : 0 : 0
Line 4 overloads the setitem function. This function is used to set the value of
an item in a list, tuple or array. Line 15 calls this function. Line 6 defines the getitem
function which retrieves the value of an element in tuple, list or array. Line 16 calls this
function. Finally, line 8 defines the str function which creates a string that the print
function calls. This function must return a string (line 12). The contents of that string
are defined by the programmer. A call to this function occurs with line 18.
13.2.5 Inheritance
Inheritance is the ability of one class to adopt the data and attributes of other classes.
An example is shown in Code 13.8. Lines 1 through 6 create a class named Human. This
has a first and last name and the ability nicely print that information as shown in lines 7
through 11. Line 12 starts the definition of a new class named Soldier which has Human
in parenthesis. This means that all of the data and functions defined in Human are also
in Soldier. Basically, Soldier is a Human. The programmer need only to write code in
Soldier for those variables and functions that are unique to a soldier. In this case, only
the rank variable is added. Lines 14 through 16 declare a new soldier and line 17 calls the
function defined in line 4.
179
Code 13.8 An example of inheritance.
Inherited classes are particularly useful for very complex programs. Each class is a
building block and the ability to inherit always building blocks to be stacked on top of
others.
A class may inherit multiple classes by separating them with commas in the declara-
tion. For example class NewClass( Class1, Class2) would be used to allow NewClass
to be built from both Class1 and Class2.
One of the features of Python is that is has the ability to add new variables to a class once
the instance has been created. This is shown in Code 13.9 which continues from Code
13.8. Line 1 creates a new variable ssn and attaches it to the current instance of Soldier.
Lines 2 and 3 confirm that this was acceptable. Line 5 creates a new instance of Soldier
and as seen by the error generated from line 6, this instance does not have ssn.
The good news is that new variables can attached to objects after the object has
been created. The bad news is identically the same. In languages like C++ a variable
must declared inside of the object before an instance is created. Thus, the coding shown
in Code 13.9 is not possible. However, this is a good way to catch typos during coding.
In the case of Python there is no safeguard. For example, the variable lastname already
exist and the case may arise that after marriage the person needs to change their last
name in the database. The programmer could write s.lastname = ’Kershaw’ but they
180
Code 13.9 Creating new variables after the creation of an object.
5 >>> t = Soldier ()
6 >>> t . ssn
7 Traceback ( most recent call last ) :
8 File " < pyshell #265 > " , line 1 , in < module >
9 t . ssn
10 AttributeError : ' Soldier ' object has no attribute ' ssn '
could also make a mistake and write s.lasname = ‘Kershaw’. In Python a new variable
is created and the old variable is not changed. The typo did not generate an error like it
would in other languages.
181
182
Chapter 14
Random Numbers
Random numbers are just what their name implies. They are numbers that are generated
by a program that are independent. The second random number has nothing to do with
the first random number.
While the concept is easy, the interesting question is how does a computational
engine generate random numbers. There has been a field of study dedicated to the com-
putational process of generating purely random numbers. This chapter will review uses of
random numbers and the features of some of the Python functions.
The numpy module provides a package of random number generations. The random.rand
and random.ranf functions create random numbers that are equally distributed between
0 and 1. Code 14.1 shows the generation of a single random number.
This same function can be used to generate a long vector of random numbers as
shown in Line 1 of Code 14.2. This generates 100,000 random numbers. Since they are
evenly distributed between 0 and 1 then the average should be very close to 0.5. This is
shown to be the case in Lines 2 and 3.
183
Code 14.2 Many random numbers.
14.2 Randomness
That is not a sufficient test to determine if a set of number are truly random. It is
possible that a function can generate a set of random numbers but the generator becomes
repetitive as shown in Figure 14.2 where the pattern repeats after x = 1024. The average
of these numbers is still 0.5 and the histogram is flat, but the generator is not really
generating random numbers.
One way of determining if a function is repetitive is to perform an auto-correlation.
This function computes the inner product for all possible shifts of a function. If a function
is not repetitive (and it is zero-sum) then the auto-correlation will have a single simple
spike because there is only one possible shift of a function with itself in which the values
are self-similar. The auto-correlation is shown in Figure 14.3,
The scipy module offers a correlate function in the signal package. This is shown in
Code 14.3. Line 2 creates a vector of zero-sum random numbers and Line 3 makes a new
184
Figure 14.2: A repeating function.
185
vector that has this original vector repeating 10 times. Thus, this is a vector of random
numbers with a repeating sequence. Line 4 performs the auto-correlation.
If the sequence is repeating then there are several shifts of the data that aligns with
the original data. Thus there are several spikes in the auto-correlation as shown in Figure
14.4.
There are other types of random distributions but the only one that is reviewed here is
the Gaussian distribution. These are not evenly distributed between 0 and 1, but are
distributed according a bell curve function.
186
where A is the amplitude, µ is the average and σ is the standard deviation. The average
of a set of numbers is computed by,
N
1 X
µ= xi , (14.2)
N
i=1
where N is the number of samples and xi are the samples. The standard deviation is
computed by, v
u
u1 X N
σ=t (xi − µ)2 . (14.3)
N
i=1
The standard bell curve is shown in Figure 14.5. The amplitude is the height of the
function, the average is the horizontal location, and the standard is the half-width and
half-maximum.
Figure 14.6 shows the function that computes the Gaussian values in Excel. This is plotted
as shown in Figure 14.7.
Excel requires some inputs from the user to generate a histogram. The procedures
begins in Figure 14.8. On the left is the original data. On the right the user manually
enters the bins for the histogram. The next step is to select Data Analysis as shown in
Figure 14.9. This selection will produced the popup menu shown in Figure 14.10. The
user selects Histogram.
The selection Histogram computes values placed on a new sheet as shown in Figure
14.11. These are the bins and frequencies of the histogram. The plot of these values is
shown in Figure 14.12.
187
Figure 14.6: The Gaussian distribution in Excel.
188
Figure 14.9: Selecting Data Analysis.
189
Figure 14.12: The plot of the results.
Figure 14.13 shows the Python command to compute a histogram. This process is paused
to show the help balloon to assist the user in providing the correct information. Code 14.4
shows the command and the returned results. The x values are the bin values and the y
values are the frequencies.
The scipy.random package offers the normal function which generates random numbers
based on a Gaussian distribution instead of a flat distribution. The call to the function is
shown in Code 14.5.
190
Code 14.5 Help on a normal distribution.
5 normal (...)
6 normal ( loc =0.0 , scale =1.0 , size = None )
Code 14.6 shows the call with three arguments. The first is the location or mean,
the second is the scale or the standard deviation, and the third is the number of random
numbers to be generated. Thus, this call produces 2 random numbers that are based on
the distribution of µ = 2.0 and σ = 1.3.
Code 14.7 is the same call except that it generates 10,000,000 numbers in this dis-
tribution. This is such a large sample that the average and standard deviation of this
sample should match the input parameters. This is so as depicted in Lines 2 through 5.
In many cases there is more than one input variable. Consider a case where the inves-
tigation concerns human health. The output is the probability of contracting a specific
disease but the input is a list of factors such as,
Cigarettes,
Drinking, and
Exercise
191
There is need for a distribution function that has several inputs. As these are difficult to
draw with more than two inputs a simple case is considered. A Gaussian distribution with
two input parameters is shown in Figure 14.14. The two horizontal axes are the inputs
and the vertical axis is the output.
This equation is actually similar to Equation (14.1). Both equations have an amplitude
A. In the exponent both equations have a − 12 . In (14.1) the (x − µ)2 is replaced by vector
~ ). The σ −1 is replaced by Σ−1 which is the covariance matrix. The
~ )T (~x − µ
forms (~x − µ
diagonal elements are related to the variances of the individual components. So, Σ1,1 is
related to the variance of the first variable.
The off-diagonal elements are related to the covariance. So, Σi,j is the covariance
between the i-th and j-th variable. This value is positive if the two variables are linked.
So, if xj goes up when xi goes up then there is a positive covariance. If xj goes down when
xi goes up then there is a negative covariance. If the two variables have nothing to do with
each other then the are independent and their covariance value is 0. The vector µ ~ controls
192
the location of the center of the distribution and Σ controls the shape and orientation of
the distribution.
Scipy offers the multivariate normal function which generates random vectors
based on a multivariate distribution. This is shown in Code 14.8.
Code 14.9 displays a small test. The first two lines generate the location and covari-
ance matrix of the distribution. Line 3 generates 100,000 random vectors based on this
distribution. Line 5 computes the covariance matrix based on the generated data which
is similar to the matrix that created the data (Line 2). Likewise, Line 7 computes the
average of the vectors and this matches the generating vector of Line 1.
14.5 Examples
This section has several examples that use random number generators.
14.5.1 Dice
Code 14.10 shows a script for simulating rolling a single die. There are six sides each with
an equal chance of being on the up side. So, Line 2 creates the six choices and Line 3
makes a single choice simulating a single roll of the die.
The random.choice@choice function will select one item at random from a list.
A second argument is the number of selects that are to be made. Thus, Line 1 in Code
193
Code 14.10 Random dice rolls.
1 >>> import numpy as np
2 >>> dice = [1 ,2 ,3 ,4 ,5 ,6]
3 >>> np . random . choice ( dice )
4 2
14.11 simulates the rolling of two dice. Two more examples are shown in the following
lines.
Code 14.11 Random dice rolls.
1 >>> np . random . choice ( dice ,2 )
2 array ([1 , 2])
3 >>> np . random . choice ( dice ,2 )
4 array ([1 , 4])
5 >>> np . random . choice ( dice ,2 )
6 array ([3 , 6])
Code 14.12 rolls two dice 1000 times and captures all the sum of each pair of dice.
The histogram of these rolls is stored and show in Figure 14.16. As seen it is far more
common to roll a 7 than it is to roll a 2.
Code 14.12 Distribution of a large number of rolls.
14.5.2 Cards
This section shows how to create a deck of cards and to shuffle them. Line 1 in Code
14.13 creates a list of the face values of the cards and Line 2 creates a list of the suits.
194
Figure 14.16: Histogram of rolling 2 dice.
The for loops started in line 4 create the full deck of cards some of which are printed to
the console.
The random.shuffle function rearranges the items in the list, which in this case is
equivalent to shuffling the deck. The result is shown in Code 14.14.
This section creates a random string from a finite alphabet. The example is to create a
DNA string and so the alphabet is merely four letters, A, C, G and T.
Code 14.15 shows a method by which this can be done. Line 1 establishes the
alphabet. Line 2 creates 100 random numbers which will determine the 100 random
195
Code 14.14 Shuffled cards.
1 >>> np . random . shuffle ( cards )
2 >>> cards [:10]
3 [ ' 9 diamonds ' , ' 4 spades ' , ' 8 hearts ' , ' 9 spades ' ,
4 ' 6 spades ' , ' 9 clubs ' , ' 7 clubs ' , ' Q diamonds ' ,
5 ' K hearts ' , ' 3 hearts ' ]
characters. Line 3 converts the random numbers to random integers from 0 up to 4. Line
4 extracts from the alphabet the letters according to the positions listed in r. In this case
the first few values in r are [0,2,1,3,1,2...] and so the first few letters in the string
are AGCTCG...]. Line 5 converts the list to a single string.
196
Problems
1. Compute the average of sets of random numbers. The number of samples in the sets
should be 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048 and 4096. Plot the average
of the random values in each set versus the number of samples.
4. Plot the histogram of 10,000 samples from a normal distribution with µ = 0.5 and
σ = 0.3.
5. Plot the histograms of two normal distributions. The first has 10,000 samples with
µ = 0.5 and σ = 0.4. The second has 9,000 samples with µ = 0.3 and σ = 0.2. What
is the value of x where the two distributions cross over?
6. Create a random DNA string with 1000 letters, but the probability of having an ’A’
is twice as much as the other three letters.
197
198
Chapter 15
15.1 Protocol
This section will repeat the same computations as in Chapter 4 with Python scripts. These
steps are:
199
15.2 A Single File
Code 15.1 displays the LoadExcel function that uses the xlrd module to load directly
from the spreadsheet. There is only one sheet in this workbook and it is named ‘Export’.
This data is collected in line 4. Lines 6 through 10 finds the row with the string “Begin
Data” which signifies where the data rows are found.
The actual reading of the data begins in line 12. Each row is collected and only the
pertinent columns of data are stored which is performed in lines 15 and 16. The result is
a list and each item in this list is a list of six items. These are the gene number, name,
channe 1 intensity, channel 1 background, channel 2 intensity and channel 2 background.
There are 1600 rows of data and so efficiency in processing can be gained by putting
the last four channels into matrices. The function Ldata2Array shown in Code 15.2
creates two matrices intens and backg. The first has two columns and 1600 rows which
are the measured intensities of the two channels. The matrix backg is the same size and
is the measured background intensities of the two channels.
The next step is to subtract the background from the intensity. However, there are
a few spots that have issues either in construction or detection in which the intensity level
is less than the background. These need to be removed. This process is started in the
function MA shown in Code 15.3. Line 3 performs the subtraction. Line 4 creates the
variable mask whic contain binary values. These are 1 for the cases in which the subtraction
produces a positive value and 0 for those few cases in which there is a negative value. Line
200
Code 15.2 The Ldata2Array function.
1 # mapython . py
2 def Ldata2Array ( ldata ) :
3 N = len ( ldata )
4 intens = np . zeros ( (N ,2) )
5 backg = np . zeros ( (N ,2) )
6 for i in range ( N ) :
7 intens [i ,0] = ldata [ i ][2]
8 intens [i ,1] = ldata [ i ][4]
9 backg [i ,0] = ldata [ i ][3]
10 backg [i ,1] = ldata [ i ][5]
11 return intens , backg
5 keeps those values that are positive and replaces the negative values with the value of 1.
The process then replicates that in Chapter 4. The next step is to calculate the
ratio R/G and the average I. The log2 of these values create the values M and A. This
function returns two vectors which are the M and A values for a single file.
The gnu module provides a function to save the data for a plotting program and
this is called in the Plot function shown in Code 15.4. A matrix named temp is created
to hold the data and this is sent to the Save function for plotting. The result is the same
as shown in Figure 4.8.
LOESS normalization is performed in the LOESS function shown in Code 15.5.
This follows the process described in Section 4.4 where the first step is to sort the data
according to the values of A. The sort order is obtained in line 5 and the gene numbers
are created in line 4 and sorted in line 6. The values of M are sorted in line 7.
The for loop begins the normalization process. Lines 10 through 15 set up the limits
for local averages with alterations for those cases where the data point is near either the
201
Code 15.4 The Plot function.
1 # mapython . py
2 def Plot ( M , A , outname ) :
3 N = len ( M )
4 temp = np . zeros ( (N ,2) )
5 temp [: ,0] = A
6 temp [: ,1] = M
7 gnu . Save ( outname , temp )
202
beginning of the list or the ending of the list. Once the limits are established the average
is computed in line 16 and is subtracted from the value in line 17.
The final two lines prepare these values for comparison with other files by subtracting
the average and dividing by the standard deviation. This follows the process from Section
4.5. Code 15.6 shows the steps for processing a single file.
With the ability to process a single file to the normalized LOESS values it is possible to
compare values from different files. The first step is to collect the names of the files to
be used. It is assumed that the files are all in a subdirectory and that there are no other
Excel files in that subdirectory. The GetNames function shown in Code 15.7 gathers all
of the names from a directory and places into a list named names all of those files that are
Excel files. These names come complete with the directory string. The output is a list of
Excel file names.
The data from all of the files can now be collected. This is performed in the AllFiles
function shown in Code 15.8. The input is the list of names. Inside of the for loop it
loads the data, converts the data to matrices, performs the calculations and places the
M values in a column of the matrix mat. This matrix has 1600 rows and the number of
203
columns is the same as the number of file names in names. The output is a matrix with
all of the normalized values.
Now, the data from all of the files is collected and the user can ask questions of the
data. One example is to collect the genes that are expressed for males but not females. In
order to ask this question only three files can be used. In the example data set there are
10 files but only 3 are used to pursue this question, so it is necessary to define a variable
to designate which columns in mat will be used. This is the list cols which simply lists
the column numbers. It is also necessary to designate if the expressed value is expected to
be greater than 1 or less than -1. Binary values in the list sels provide this information.
The call for this function is shown in the last line of Code 15.9. The input data is
the output from AllFiles. The second argument indicates that only columns 0, 4 and 8
will be used. The third argument indicates that in the first column the search is for values
of 1 or more, the second column is searched for values of -1 or less, and the last column is
searched for values of 1 or more.
The Select functions extracts the needed columns in lines 7 and 8. The loop started
on line 9 begins the search for the desired values. In this case, the matrix temp has thre
columns and 1600 rows. The values are 1 if the gene is expressed and 0 otherwise. The
vector tot sums temp horizontally. Line 15 searches for those values that are 2 or higher
indicating which rows had at least files that expressed the gene.
Finally, the Isolate in Code 15.10 finds the genes of interest. There are at least two
samples of each gene in the data file and so values are collected according to gene name.
Line 3 creates a new dictionary and the key will be the gene name. The input hits is
the data from Select and is the gene number of those genes that are expressed. The loop
started in line 4 considers each of these. If the gene has been seen before then line 7 is
used to append the gene number to the list in the dictionary entry. If the gene had not
been seen before then line 9 is used to create the dictionary entry.
The search is for cases where the gene is expressed in at least two files for both
204
Code 15.9 The Select function.
1 # mapython . py
2 def Select ( mat , cols , sels ) :
3 answ = []
4 N = len ( mat )
5 M = len ( cols )
6 temp = np . zeros ( (N , M ) )
7 for i in range ( M ) :
8 temp [: , i ] = mat [: , cols [ i ]]
9 for i in range ( M ) :
10 if sels [ i ] == 1:
11 temp [: , i ] = temp [: , i ] > 1
12 else :
13 temp [: , i ] = temp [: , i ] < -1
14 tot = temp . sum (1)
15 hits = ( tot >=2) . nonzero () [0]
16 return hits
17
samples of the gene. This is performed in lines 14 through 16. The few results are listed
and these match those from Chapter 4.
205
Code 15.10 The Isolate function.
1 # mapython . py
2 def Isolate ( hits , ldata ) :
3 genes = {}
4 for i in hits :
5 gene = ldata [ i ][1]
6 if gene in genes :
7 genes [ gene ]. append ( i )
8 else :
9 genes [ gene ] = [ i ]
10 return genes
11
206
Part III
Computational Applications
207
Chapter 16
DNA as Data
This chapter reviews some of the basic ideas of DNA and then proceeds to consider pro-
grams to read in the standard files. This chapter concludes with a couple of applications.
16.1 DNA
Each cell in any animal or plant contains a vast amount of DNA (deoxyribonucleic acid).
A typical cell contains a nucleus surrounded by cytoplasm as depicted in Figure 16.1.
Figure 16.1: A simple depiction of a cell with a nucleus, cytoplasm, nuclear DNA and mitochondrial
DNA.
Within the nucleus of a human cell resides 22 chromosomes plus either and XX or
XY chromosome depending on the gender. These chromosomes contain strands of DNA
which are coiled as a double helix as shown in Figure 16.2. This helix though is precisely
folded several times to allow a large amount of DNA can fit into a tiny cell.
Connecting the two helices are nucleotides of which there are four different variations:
209
Figure 16.2: A caricature of the double helix nature of DNA.
A: adenine,
T: thymine,
C: cytosine, and
G: gaunine.
Each of these are commonly represented by their first letter. Thus, a long strand of DNA
is represented by a long string of letters from a four letter alphabet.
The opposing helix contains the complementary strand. Wherever the first strand
has an A the complement as a T. Likewise, the complement of T is A. The C and G are
also complements. Thus, if the DNA sequences in one strand are known then the sequence
in the complementary strand is also known.
Within a single human cell the nucleus contains over 3 billion nucleotides. If this
DNA were unfolded and connected into a single strand it would be about 10 meters long.
So, complicated folding is absolutely required.
Not all of the DNA is located in the nucleus. Mitochondrial DNA is located in the
cytoplasm. These are short rings of DNA that are inherited from the biological mother.
Viruses and bacteria are also constructed from DNA.
Segments of DNA contain the information needed to create proteins which are long
strands of amino acids. However, vast regions of the DNA are not used for this purpose.
The process of creating proteins begins with the DNA unfolding to expose segments of the
helix. These segments are replicated creating short strands of mRNA (messenger RNA)
which escape from the nucleus into the cytoplasm. During this process, thymine is convert
to uracil and so the represented string replace T with U.
Once in the cytoplasm a ribosome attaches to the DNA and the traverses the strand
building a protein as depicted in Figure 16.3. In this process, three nucleotides are trans-
lated into an amino acid, and when completed this chain is the protein. The group of
three nucleotides is called a codon. The translation table from codon to amino acid is
shown in Figure 16.4. So in the image the first codon ACC is used to create T, GAC is
used to D, etc.
210
Figure 16.3: The ribosome travels along the DNA using codon information to create a chain of
amino acids.
Of course, the process is not at all as simple as this. There are several complications
some of which require intense study to comprehend. One of the major complications is
that the gene may be encoded as splices in the DNA. To create a single gene, several
locations in the DNA are used. Figure 16.5 shows the case where four different splices
(labeled A, B, C and D) are used to create a long strand of mRNA which is then translated
into a single protein. The coding regions are named exons and the intermediate regions
are called introns. Splicing can be even more complicated. It is possible that a gene is
created from exons A, B and D while a different gene is created from exons A, C and D.
Genes can exist on either strand of the helix. Detecting these genes is a science in
itself. Commonly, the beginning of a gene has ATG as the first codon. This is named
the start codon. However, this combination of nucleotides exists throughout the genome
and the presence of this combination most often not a start codon. This codon also codes
for the amino acid methionine. It is also possible that the three nucleotides can exist in
this pattern by fortune. For example one codon may be TAT and the next GCC. This
combination also has the consecutive nucleotides ATG. Finally, this combination can also
exist in a non-coding region. There are other start codons that are possible: GTG and
TTG. Even rarer are ATT and CTG. There are three codons that are considered to be stop
codons: TAG, TGA, and TAA. However, not all coding regions end with a stop codon.
For a contiguous coding region the number of nucleotides between a start and stop
codon should be a multiple of three since there are three nucleotides in a codon. However,
if the gene is constructed from splice then there are intron regions without any restriction
on the number of nucleotides.
The non-coding regions between the genes are not necessarily random either. There
are many regions in which the DNA sequence repeats. The number of nucleotides that
compose a repeating segment varies, the number of repeats vary, and the pattern of the
repeat can also vary. Since these regions are not used in creating genes, mutations are
not devastating to the host. So, these regions are less conserved through evolution. A
mutation occurs when a nucleotide in a child’s DNA has been changed from the parent’s.
211
Figure 16.4: Codon to Amino Acid Conversion
212
Figure 16.5: Spliced segments of the DNA are used to create a single protein.
As stated the length of a non-spliced coding region should be a multiple of three and that
this coding region should begin with a start codon and end with a stop codon. Since
bacteria rarely have spliced genes such a genome can be examined. The goal of this
application is to inspect every gene in a genome and capture those that do not have a
length that is a multiple of three or the correct start and stop codons.
The file used is from the Genbank repository and is identified uniquely by an acces-
sion number. These are detailed in Chapter 18. For this application the data has been
extracted from the Genbank file and stored in two files:
The first file contains the DNA string which has over 490,000 characters. The second file
is a tab delimited file with three columns. This file can be imported by a spreadsheet to
view. The first column is a start location, the second column is the stop location and the
third column is a complement flag. The first row in this file has three values: 883, 2691
and 0. The last value is either 0 or 1 and in this case the 0 indicates that this is not a
complement string. The beginning of the string is at location 883. However, there will be
a discrepancy since the Genbank file starts counting at 1 and Python starts counting at
0. So, after Python has read the string from the file the starting location will actually be
882.
Reading the DNA file is simple as shown in Code 16.1 which shows the LoadDNA
function. It simple reads the text file and returns the contents.
Code 16.2 shows the call to this function in line 2. This is a very long string as
confirmed in line 4. Therefore, the whole string should never be printed to the console.
Users of the IDLE interface will quickly learn that attempting to print such long strings
213
Code 16.1 The LoadDNA function.
1 # simpledna . py
2 def LoadDNA ( dnaname ) :
3 dna = open ( dnaname ) . read ()
4 return dna
will bring the interface to a crawl. Line 5 shows that the loading of the file can be confirmed
by printing out a much smaller portion.
Reading the second file requires a bit more programming as it is more than just a single
string of data. Rather this process is that of reading a tab delimited spreadsheet as shown
in Section 8.5.2. Code 16.3 shows the LoadBounds function which reads in the entire file
as a string in line 4. The outer loop started in line 8 considers each row of data and the
inner loop started in line 9 considers each of the three entries in that row. These entries
are converted to integers and appended as a tuple to the list bounds.
The function is called in line 17. To ensure that the read was successful the length
of the list and the first item in that list is printed. So, now both the DNA string and the
information about the locations of the genes has been loaded.
Line 21 of Code 16.3 shows that the coding region starts at location 883 in the Genbank
data. Since Python starts indexing at 0 instead of 1 the location of the start of the gene
in the DNA string is actually 882. Line 1 in Code 16.4 computes the length of the gene
which is 1809. Line 4 shows that this is divisible by 3 which passes one of the tests for a
gene.
The first codon is printed in line 6 and the last codon is printed in line 8. These
do qualify as a start and stop codon respectively. So, this has the three qualities that are
214
Code 16.3 The LoadBounds function.
1 # simpledna . py
2 def LoadBounds ( boundsname ) :
3 fp = open ( boundsname )
4 rawb = fp . read ()
5 fp . close ()
6 bounds = []
7 bdata = rawb . split ( ' \ n ' )
8 for i in range ( len ( bdata ) ) :
9 if len ( bdata [ i ] ) > 1:
10 start , stop , cflag = bdata [ i ]. split ( ' \ t ' )
11 start = int ( start )
12 stop = int ( stop )
13 cflag = int ( cflag )
14 bounds . append ( ( start , stop , cflag ) )
15 return bounds
16
1 >>> 2691-882
2 1809
3 >>> 1809 % 3
4 0
5 >>> dna [882:885]
6 ' atg '
7 >>> dna [2688:2691]
8 ' taa '
215
sought.
Some of the genes are complements and so it is necessary to convert them to the
complementary string before analysis can be executed. The genbank module does have
a Complement function that can perform this conversion. This function is detailed in
Chapter 18. Code 16.5 imports this module in line 1.
The second gene in the data is a complement. Line 3 in Code 16.5 shows that the
last item is a 1 which indicates that this is a complement. The coding portion for this
gene is extracted to a string named cut in line 4. The complement is computed in line 4.
As seen the first codon of comp is a start codon and the last codon is a stop codon.
The function CheckForStartsStops in Code 16.6 performs the three checks on
all genes. The inputs are the DNA string and the list of bounds. Line 4 creates the list
named bad which will capture information about any gene that does not pass the tests.
Information about the first gene is obtained in line 6 and the string cut is the coding region
for a single gene. If the complement flag is 1 then line 9 will be used which computes the
complement of the gene. Line 10 determines if the string length is a multiple of three.
This computes the modulus and if m3 is 0 then the length is a multiple of 3.
The start and stop codons are extracted in lines 11 and 12. Line 13 begins a long if
statement. The backslashes at the end of lines 13 and 14 indicate that the line continues
to the next line. This complicated structure determines if the gene does not have a start
codon, stop codon or the length is not a multiple of 3. If any condition fails then line 18
is used and the list bad gets an entry.
Code 16.7 calls CheckForStartsStops and returns a list that contains all genes
that failed the tests. As seen this list has 0 entries and therefore all genes in this bacteria
have a length that is a multiple of 3 and a proper start and stop codon.
216
Code 16.6 The CheckForStartsStops function.
1 # simpledna . py
2 def C he c k Fo r S ta r t sS t o ps ( dna , bounds ) :
3 N = len ( bounds )
4 bad = []
5 for i in range ( N ) :
6 start , stop , cflag = bounds [ i ]
7 cut = dna [ start-1: stop ]
8 if cflag :
9 cut = gb . Complement ( cut )
10 m3 = ( stop-( start-1) ) % 3
11 startcodon = cut [:3]
12 stopcodon = cut [-3:]
13 if m3 ==0 and ( startcodon == ' atg ' or startcodon == ' gtg ' \
14 or startcodon == ' ttg ' ) and ( stopcodon == ' tag ' or \
15 stopcodon == ' taa ' or stopcodon == ' tga ' ) :
16 pass
17 else :
18 bad . append ( ( start , stop , cflag ) )
19 return bad
217
Problems
1. In the file provided, write a Python script to load the DNA and then count the
number of ATG’s that exist in the data.
2. In the file provided, write a Python script that gathers in a list the location of all of
the ATG’s.
3. In the file provided, write a Python script to gather all of the codons that immediately
precede the ATG’s.
5. Write a Python script to load the spreadsheet data and find the shortest gene.
6. Create a dictionary in which the keys are the codons and the values are the associated
amino acid. Write a Python script to convert the first gene from the list of DNA to
a list of amino acids using this dictionary.
218
Chapter 17
Application in GC Content
Some regions in the DNA are rich in cytosine and guanine.. These are called GC rich
regions. This chapter will explore methods to explore for these regions.
17.1 Theory
The GC content is measured as the number of G’s and C’s over a finite window length,
W,
NC + NG
ρ= , (17.1)
NA + NG + NC + NT
where Nk is the count of nucleotide k over this window. In most cases the denominator
will also be the window size, but there are cases were a nucleotide is known to exist at
a location but the identification of that nucleotide has been difficult to achieve. In those
cases the denominator may be smaller than the window size.
The computation is performed over a window that slides along the DNA sequence
as shown in Figure 17.1. In this example the window width is 8 and so it includes 8
nucleotides. In the first window there are 3 G’s and 2 C’s, and so the value of the GC
content is ρ = 58 . The step value is 4 and in the next time step the window is moved 4
places to the right and the computation is repeated and it also produces a value of ρ = 58 .
In the third time step ρ = 48 , and in the last time step ρ = 85 . In a real application both
the window and the step sizes are much larger.
The concept is easy to implement in Python code as shown in Code 17.1. The function
GCcontent receives a string of DNA named instr. There are also two additional
arguments that control the window size and the step size. Line 5 considers a substring
219
Figure 17.1: A sliding window with a width of 8 and a step of 4.
of the DNA which is named cut. This is converted to lowercase for further processing.
The next four lines count the number of occurrences of each nucleotide and the ratio is
computed in line 10 which is append to a list name answ and returned to the user.
A very simple example is shown in Code 17.2. Line 2 creates a string and line 3 calls
GCcontent. In this case the window size is 8 and the step size is 4. The string does have
a GC rich region towards the beginning but ends in a GC poor finale. These attributes
are reflected in the values returned by the function. As the window passes through the
GC rich region the value of ρ becomes much larger than 0.5, and as the window passes
through the GC poor region the value falls much lower than 0.5.
220
17.3 Application
The bacteria mycobacterium tuberculosis has GC rich genes and therefore is a good
genome to use in this process. Two files accompany this experiment. The first is data/nc000962.txt
which contains the DNA for the entire genome, and the second is data/ncc000962bounds.txt
which contains the start location, stop location and the complement flag. For this study
there are sufficient genes that the complements will not need to be considered.
Functions for reading these two types of files have already been used elsewhere.
See Codes 16.1 and 16.3. Line 2 in Code 17.3 loads the DNA string. This string has
over 4 million characters and so there should be no attempt to print the entire string
to the console. Line 5 loads the bounds data. This indicates that there are 3906 genes.
Each one has a start location, stop location and a binary value indicating if the gene is a
complement.
221
6. Compute the average and standard deviation over these values.
7. Compare the statistics for these three designated regions.
In this part of the application the goal is to obtain the GC content over large non-coding
regions. These regions are defined as beginning at the end of one gene to the beginning of
the subsequent gene. There are two caveats. The first is that the 50 bases preceding a gene
will not be considered since they will be considered in the third part of this application.
The second is that the regions must have a minimum length which is arbitrarily set to
128.
Function Noncoding shown in Code 17.4 receives the input DNA string and the
bounds data. Line 5 gets the end of one gene and line 6 gets the beginning of the next
gene. This distance needs to be at least 178 bases since the 50 bases in front of a gene
are to be excluded and the remainder needs to be at least 128 bases. The cut is the
non-coding DNA between these two regions and line 9 retrieves the GC content values
over a sliding window and puts these in a growing list.
1 # gccontent . py
2 def Noncoding ( indna , bounds ) :
3 answ = []
4 for k in range ( len ( bounds )-1) :
5 stop = bounds [ k ][1] # stop of first gene
6 start = bounds [ k +1][0] # start of next gene
7 if start-stop > 178:
8 cut = indna [ stop : start-50]
9 answ . extend ( GCcontent ( cut ) )
10 return answ
Gathering the average and standard deviation of these values is easily done as shown
in the function StatsOf shown in Code 17.5. The list of values is converted to a vector
and then the statistics are returned.
Code 17.6 shows the operation and results. Line 1 gathers the GC content informa-
222
tion over the non-coding regions and line 2 returns the average and standard deviation.
As seen the GC content in the non-coding regions is actually quite a bit higher than 0.5.
The second part of the application is to compute the same statistics over the coding
regions. Since there are plenty of genes the complements will not be considered.
Function Coding shown in Code 17.7 extracts the GC content values from sliding
windows over coding regions. Line 6 ensures that this coding region has a sufficient length
and is not a complement. Code 17.8 shows that there are over 60,000 such values extracted
and that the average is 0.656.
1 # gccontent . py
2 def Coding ( indna , bounds ) :
3 answ = []
4 for k in range ( len ( bounds ) ) :
5 start , stop , cflag = bounds [ k ]
6 if cflag == 0 and stop-start > 128:
7 answ . extend ( GCcontent ( indna [ start : stop ] ) )
8 return answ
This part of the application is to consider the regions just in front of the coding regions.
Again the complements will not be considered since there is plenty of data.
223
Code 17.9 shows the Precoding function that extracts the GC content factors from
sufficiently long regions in front of non-complement genes. Line 8 ensures that the region
has at least 50 bases and is not a complement. Code 17.10 runs this test and extracts the
average and standard deviation.
1 # gccontent . py
2 def Precoding ( indna , bounds ) :
3 answ = []
4 for k in range ( len ( bounds )-1) :
5 stop = bounds [ k ][1] # stop of first gene
6 start = bounds [ k +1][0] # start of next gene
7 cflag = bounds [ k +1][2]
8 if start-stop >50 and cflag ==0:
9 cut = indna [ start-50: start ]
10 answ . extend ( GCcontent ( cut ) )
11 return answ
17.3.4 Comparison
The final step is to compare the distributions of GC contents from the different regions.
This is accomplished by plot the Gaussian distributions for the three cases. These are
shown in Figure 17.2.
The distributions are relatively close which means that there is no drastic difference
between the regions. The smallest average corresponded to the precoding region and the
largest region corresponded to the coding region. In the search for coding regions in a
large genome the GC content fluctuation could be indicator.
It should also be noted that in this genome all averages are above 0.5. That means
that the entire genome is GC rich. This is not the case in other genomes. GC content is
another metric that can be used to compare contents of differing genomes as well.
224
Figure 17.2: Gaussian distributions of the three cases.
Problems
1. Does the size of the sliding window affect the gathered statistics? Repeat the GC content
measures for all three regions but use a sliding window that is have the size of the original.
Answer the question by comparing your results to those printed in Section 17.3.4.
2. Does the step size affect the gathered statistics? Repeat the GC content measures for all
three regions with a step size that is half of the original. Compare your results to those in
Section 17.3.4 to answer the question.
3. The previous chapter used data for AE017199. Compute the GC content over the three
regions for this genome and compare to the data in Section 17.3.4.
4. In the coding regions did the G or C dominate? For each gene compute the ratio G/C to
answer the question.
5. The coding regions consist of codons which are three nucleotides. Is the distribution of G’s
and C’s the same for all codon positions? To answer this question count the G’s and C’s for
each of the three positions in the codons for all of the genes.
6. Do the complement genes have a different distribution of GC content values? Compute the
GC content over the complement genes. Compare the average and standard deviation of
these values to the non-complement genes.
225
226
Chapter 18
Large databases of DNA information are being collected by several institutes. In the
US the large repository is Genbank hosted by the National Institutes of Health (http://
www.ncbi.nlm.nih.gov/Genbank/index.html). The concern of this chapter is to develop
programs capable of reading the files that are stored three of the most popular formats:
FASTA, Genbank, and ASN.1.
The FASTA format is extremely simple but it contains very little information aside from
the sequence. A typical FASTA format is shown in Figure 18.1. The first line contains
a small header that may vary in content. In this case the accession number and name of
species and chromosome number are given. Some files may have comment lines after the
first line that being with a semicolon. The rest of the file is simply the DNA data.
Code 18.1 shows the commands needed to read in this file. The first version shown
opens the ‘NC 006046.fasta.txt’ (retrieved from [NC0, 2011]), reads the data, and closes
the file. The second version performs all three in a single command. The readlines
function will read all of the data and return a list. Each item in the list is a line of text
ending in a newline character. In the FASTA file there is a newline character at the end
of the header and one at the end of each line of DNA.
227
Code 18.1 Reading a file.
1 # version 1
2 >>> fp = open ( ' data / nc_006046 . fasta . txt ' )
3 >>> a = fp . readlines ()
4 >>> fp . close ()
5 # version 2
6 >>> a = open ( ' data / nc_006046 . fasta . txt ' ) . readlines ()
Code 18.2 shows the first few elements in the least. Lines 1-3 show the header
information. The rest of the items in list a are the lines of DNA characters.
1 >>> a [0]
2 ' > gi |50428312| ref | NC_006046 .1| Debaryomyces hansenii CBS767
chromosome D ,
3 complete sequence \ n '
4 >>> a [1]
5 ' CCTCTCCTCTCGCGCCGCCAGTGTGCTGGTTAGTATTTCCCCAAACTTTCTTCGAAT
6 GATACAACAATCA \ n '
7 >>> a [2]
8 ' CACATGACGTCTACATAGGAGCCCCGGAAGCTGCATGCATTGGCGGCTGATGCGTCA
9 GTGCCAGTGCTCA \ n '
As can be seen each line ends with the newline character \n. So, the only tasks
remaining are to combine all of the DNA lines into a long string and to remove the
newline characters. Combining strings in a list is performed by the join function (see
Code 6.38). The join function combines all but the first line of data and the empty quotes
indicates that there are no characters in between each line of DNA. Code 18.3 joins the
strings and removes the newline characters.
In this case the DNA string is 1,602,771 characters long. Basically, it takes only
three lines of Python code to read a FASTA file and extract the DNA. In actuality it
could only take one line as shown in Code 18.4. However, such code does not increase the
speed of the program and is much more difficult to read, so it should actually be avoided.
228
Code 18.4 Performing all in a single command.
1 >>> dna = ( ' ' . join ( open ( ' data / nc_006046 . fasta ' ) .
2 readlines () [1:]) ) . replace ( ' \ n ' , ' ' )
Genbank files are text-based files that contained considerably more information than
FASTA. Genbank files contain information about the source of the data, the researchers
that created the file, the publication where it was presented, the DNA, the proteins, repeat
regions, and more. However, some of these items are optional and not every file contains
every possible type of data.
Genbank files are text files and can be viewed with text editors, word processors, or
even the IDLE editor. It is worth the time to load a file and examine its contents.
Figure 18.2 shows the first few lines a Genbank file (accession NC 006046). The first four
lines display the locus identification, the definition of the file, the accession number and
the version. As can be seen the capitalized keywords are following by the data and each
entry ends with a newline.
As there are many items in this file this chapter will not develop code to extract all
of them. Instead code will be developed to extract the most important items which will
demonstrate how the rest of the items can be extracted. While it is possible to develop code
to completely automate the entire reading process a different approach is adopted here. It
is highly possible that user only wants a small part of the file (just the DNA information
for example) and so functions will be built to extract the individual components. These
functions can be called individually or the user could easily build a driver program to call
the desired functions.
The ReadFile function is shown in Code 18.5. Line 3 opens a file from the given
file name and Line 4 reads the data. Line 6 returns the contents of the file as a single long
string. Line 8 shows an example call to the function.
229
Code 18.5 The ReadFile function.
1 # genbank . py
2 def ReadFile ( filename ) :
3 fp = open ( filename )
4 data = fp . read ()
5 fp . close ()
6 return data
7
The DNA information is the last entry in the file although it consumes more than half of
the file. In this example the DNA information starts around line 15,394 of this file which
contains 42,110 lines of text. The first four lines at the beginning of the DNA section and
the final four lines are shown in Figure 18.3.
The word ‘ORIGIN’ begins the DNA section and each line contains six sections of
10 bases. The last line may be incomplete and the final line of the file is two slashes. In
order to extract the DNA several steps are necessary. First, this information needs to be
taken from the file. Second the line numbers need to be removed. Third, the groups of 10
bases need to be combined into a long string.
There are many functions in the genbank module and so they are not reprinted in
this chapter. Only the calls to the functions are shown. However, readers should feel free
to examine the codes at their leisure.
The function ParseDNA extracts the DNA string from the file and removes the
first column and blank spaces. IT returns a long string of just the DNA characters as seen
in Code 18.6. Usually, these strings are very long, including this example which is over 3
million characters. These should not be printed to the console in their entirety. However,
it is possible to print just a portion.
230
Code 18.6 Calling the ParseDNA function.
18.2.3 Keywords
Consider the data in this file starting at line 60 shown in Figure 18.4. It indicates that
there is a gene which begins at location 2657 and ends at location 3115. This particular
gene is on the opposing strand of the double helix and so the data in this file is the
complement of the gene. This is actually an mRNA and other annotations are provided.
This is not the complete list of information that is available. Some files will list genes
and their protein translations for example. This optional information will be explored in
a later section.
This section is concerned with the ability to identify the location of the gene in-
formation in the file. Obviously the information begins with the keyword gene and so it
should be identified. In this file the keyword mRNA is used but it other files there are dif-
ferent keywords depending on the type of data. Some files indicate repeating regions, gaps,
etc. Thus, it is necessary to find any type of keyword and then extract the information
following it. Words used as keywords may also be used elsewhere in the file. For example
‘gene’ is commonly found in other locations. The keywords in the file are preceded by
five space characters and then by several space characters depending on the length of the
keyword. When the word ‘gene’ is used elsewhere in the file is not preceded and followed
by multiple spaces. The default keyword should be ‘ CDS ’ or ‘ gene ’ including the spaces
before and after the characters.
231
The function FindKeywordLocs finds all of the locations of the keyword in the
data stream. The function can receive a second argument if the user wishes to change
the keyword. It returns a list of integers that are the locations of the keyword in the long
string named data. As seen in Code 18.7 there are 3169 such locations indicating that
this file has 3169 genes. Line 6 prints out the first 100 characters from the first location.
As seen it starts with spaces and CDS.
A gene is a coding region in the DNA. It has at least one starting location and an ending
location. The data may be in its complement form or the coding DNA may be composed
of splices. The code developed in this section will extract the locations of the coding
sequences and an indication if it is a complement. There is a small difference between
Python and Genbank indexing. The first DNA base in the Genbank file is at location 1
as shown in Code 18.6. Python, however, uses the index 0 for the first location. So, the
locations of the coding splices will differ from the Python strings by a value of 1.
The line that follows the keyword ‘ CDS ’ has a few different forms as shown in
Figure 18.5. The first example is merely a start and end location of the coding sequence.
The second example is a complemented string. The third demonstrates a splice in which
the coding sequence is found in two sections. The fourth is a complemented splice. The
fifth example shows multiple splice locations. The final example shows a ‘>’ symbol which
indicates that the exact location is not known. There is no limit on the number of splices
that a coding sequence can have. Thus, a function that is to extract the locations of the
coding region(s) for a single gene needs to be able to handle all of these situations.
For the purpose of extracting gene locations the complement flag only needs to be
noted as the actual complement operation is performed later. The beginning and ending of
a splice location are two numbers separated by two periods. When there is a complement
or a join the first and last splice will have a parenthesis.
In regions where there are splices the entry will start with the word ’join’ and then
a parenthesis. Inside of the parenthesis each splice is denoted with two numbers separated
232
Figure 18.5: Indications of complements and joins.
by two periods and the splices are separated by commas. The number of splices is the
number of two period combinations.
Several functions are needed to dissect the information. These are not shown here
but only reviewed. The first is EasyStartEnd which is called if the particular gene has
no splices. The Splices function is called if the gene has splices in it. Both functions
read the information just after the keyword and extract the locations of the splices. It is
necessary that both functions return the same format for their information. Thus, a gene
without splices must return the start and stop location as a single splice.
The output of both functions is a list for a single gene. This list has two items. The
second is a binary flag for a complement. If the gene is a complement then this flag is
True. The first item is list of splices. Each splice is a tuple of start and stop locations. A
gene with no splices is still encoded as a list with a single tuple inside.
The function that the user calls is GeneLocs . This will receive the output from
FindKeywordLocs and the output is a list of lists. The call is shown in line 1 of Code
18.8. As seen the number of items in the list is the number of genes. The first item is
shown and it is not a complement. It has a single start and stop location as well. Had it
been a splice then there would have been other tuples within the inner list. This file is a
bacteria and has no spliced genes.
233
18.2.5 Coding DNA
Another keyword that is used is ‘complement’. If the flag compf is True then the comple-
ment of the DNA string needs to be computed. This is accomplished by swapping ‘T’ and
‘A’ as well as swapping ‘C’ and ‘G’. Finally, the string needs to be reversed. Code 18.9
shows the Complement function which swaps the letters and reverses the string.
1 # genbank . py
2 def Complement ( st ) :
3 table = st . maketrans ( ' acgt ' , ' tgca ' )
4 st1 = st . translate ( table )
5 st1 = st1 [::-1]
6 return st1
The Translation function retrieves the amino acid sequence for a given gene. Just after
the keyword are several entries of which one is designated as translation. Following this
is the amino acid sequence.
This function is shown in Code 18.11. It receives the data string and a single keyword
location. Line 3 searches for the word translation and then the following lines find the
beginning and ending of the amino acid strings. It removes newline characters in lines 6
and 7. Unix systems use a single character for a newline but MS-Windows systems use
two and so two lines of code are needed to remove these. Blank spaces are removed in line
8 and the return is a string with the amino acid sequence. The word translation has 11
characters and is preceded by a backslash. There is also an equals sign and a quote that
234
follows. Thus, there are 14 characters from the backslash to the beginning of the data.
Lines 4 and 5 reflect this distance with the numerals in the code.
1 # genbank . py
2 def Translation ( data , loc ) :
3 trans = data . find ( ' / translation ' , loc )
4 quot = data . find ( ' " ' , trans + 15 )
5 prot = data [ trans +14: quot ]
6 prot = prot . replace ( ' \ n ' , ' ' )
7 prot = prot . replace ( ' \ r ' , ' ' )
8 prot = prot . replace ( ' ' , ' ' )
9 return prot
10
The information after a keyword can have several entries, but not all files have the
same type of data. These can be can read by writing a function similar to Translation
with modifications to lines 3, 4 and 5.
18.3 ASN.1
The ASN.1 format is another format that contains several different types of information
about the data. In this file information is encapsulated within curly braces. The first part
of a file is shown in Code 18.12. The sequence data starts with a ‘{’ and ends with a ‘}’
(which is not shown in the code). Inside of this set of braces are other items. Shown in
the code is the id data, and inside of it is other and general. There are several different
types of entries in this file but only the data will be examined here.
In this particular file the DNA data starts as shown in Code 18.13. The actual DNA
string starts just after ncbi2na, but the data is compressed to reduce file size.
For DNA there are only four different letters that are used (A,C,G,T) and thus it
is inefficient to store each letter as a single byte of information. Since there are only four
items it only takes two bits to encode them. Table 18.1 shows the encoding used in ASN1
235
Code 18.12 The opening lines of an ASN.1 file.
236
files. Thus, a string such as ATTG would be encoded as 00111110. This binary string is
converted to a standard hexadecimal string according to Table 18.2. The binary string
00111110 would then be converted to 3E.
Table 18.1: Binary representation of nucleotides.
DNA Code
A 00
C 01
G 10
T 11
Decoding the ASN.1 format is just the reverse of this process. The lookup table is
created as a Python dictionary as in function DecoderDict shown in Code 18.14. The
codes in this section are stored in the file asn1.py.
Code 18.15 shows the reader function DNAFromASN1 and the call to it. In Line
7 the function finds ‘ncbi2na’ and then the single quotes that follow it. These quotes
surround the compressed DNA string. The string is extracted in Line 10 and the newlines
237
are removed in Line 11. Line 15 considers each letter in the string and uses the dictionary
to look up the conversion.
The ASN.1 format also contains the locations of coding regions. One example is
shown in Code 18.16. This is also very easy to extract. By simply finding keywords such
as ‘location’, ‘from’, and ‘to’ the beginning and end of a coding region can be extracted.
18.4 Summary
DNA information is stored in several formats. Two of the most popular are FASTA and
Genbank. The FASTA files are very easy to read and this takes only a few lines of code.
The Genbank files are considerably more involved and store significantly more information
beyond the DNA sequence. They can store identifying information, publication and author
information, proteins, identified repeats and much more. Thus, reading these files requires
a bit more programming. These programs, however, are not complicated.
238
Code 18.16 DNA locations within an ANS.1 file..
1 ...
2 comment " tRNA Asp ( GTC ) cove score =60.37 " ,
3 location
4 int {
5 from 177641 ,
6 to 177712 ,
7 strand plus ,
8 id
9 gi 294657026 } ,
10 ...
Problems
1. Write a Python script that can extract all of the sequences from the file Synechocys-
tis.fasta.txt. The output of the function should be a list and each item in a list is a
string (without header information) for a single gene.
2. Write a Python function that can extract the protein id information from a Gen-
bank file.
239
240
Chapter 19
Data generated from experiments may contain several dimensions and be quite compli-
cated. However, the dimensionality of the data may far exceed the complexity of the
data. A reduction in dimensionality often allows simpler algorithms to effectively ana-
lyze the data. The most common method of data reduction in bioinformatics is principal
component analysis.
Principal component analysis (PCA) is an often used tool that reduces the dimensionality
of a problem. Consider the following three vectors,
~x1 = {2, 1, 2}
~x2 = {3, 4, 3}. (19.1)
~x3 = {5, 6, 5}
Each vector is in three dimensions, R3 , and therefore a three-dimensional graph would be
needed to plot the data. However, the first and third elements are the same in each vector.
The third element does not have any new information, in that if the first element is known
then the third element is exactly known. Even though the data is in three dimensions the
information contained in the data is in, at most, two dimensions.
Of course, this can be expanded to larger problems. Quite often a single biological
experiment can produce a lot of data, but due to time and costs, only a small number of
experiments can be run. So their are few data vectors that have a lot of elements. The
dimensionality of the data is large, but the dimensionality of the information is not. So,
PCA is a very useful tool that reduces the dimensionality of the data without damaging
the dimensionality of the information.
Conceptually, PCA is not a difficult task as it merely rotates and shifts the coordi-
nates to provide an optimal view of the data. Consider the two dimensional data shown in
241
Figure 19.1(a). In this example, there are five vectors each with a dimension of two. The
PCA algorithm will shift the data so that the average is located at the center of the coor-
dinate system and then rotate the coordinate system to minimize the covariance between
data data in different coordinates. This is explained in more detail subsequently. Figure
19.1(b) shows the old coordinate system (the lines at an angle) and the new coordinates
system. Figure 19.1(c) shows the data after the transformation.
(a) The original data in R2 . (b) Rotating the coordinate (c) The same data in a
system. new coordinate system.
The property of the data is that it is centered in the coordinate system and the
covariance is minimized. In this case, that minimization found a rotation in which one of
the axis is no longer important. All of the data has the same y value and therefore only
the x axis is important. The two dimensional data has been reduced to one dimension
without loss of information, as the points still have the same relative position to each
other.
The dimensionality can be reduced when one coordinate is very much like another or
a linear combination of others. This type of redundancy becomes evident in the covariance
matrix which has the ability to indicate which dimensions are dependent on each other.
PCA minimizes the covariance within a data set, and this information is contained within
a covariance matrix. This matrix contains information about the relationships of the
different elements in the input vectors which is also information about the proper view of
the data.
Consider the data in Figure 19.2 which consists of four data vectors each with five elements.
The standard deviation (σ 2 ) and variance (σ) of each column are shown. The variance
indicates the spread of the data from the mean value.
242
Figure 19.2: A small data set.
The variance, however, only provides information for the elements individually as
the variance in the first columns is not influenced by the data in the other columns. The
purpose of the covariance is that it relates one column to another. Basically, if the data
in two columns are positively correlated (when one goes up in value so does the other)
then the covariance is positive. If the data in the two columns are negatively correlated
then the covariance is negative. If the data in the two columns are independent then the
covariance should be zero.
The covariance is defined as,
ci,j = (~xi − µ
~ ) · (~xj − µ
~) , (19.2)
where µ~ is the mean of all of the data vectors and the elements ci,j define the covariance
matrix C. The covariance value c(1, 3) links the data in column 1 with the data in column
3 as shown in Figure 19.3.
Consider a case of 1000 random valued vectors of length 5. Since the data is random
there are no links between the different elements. Thus, the covariance values should be
close to 0. Code 19.1 shows the creation of this data in line 1. Line 2 uses the cov function
to compute the covariance matrix. The diagonal elements relate to the variances of the
individual elements, whereas the off-diagonal elements relate to the covariances. As seen
243
the off-diagonal elements are much closer to 0 than are the diagonal elements. Such data
is considered to be independent as activity in one element is not related to activity in the
other elements.
A second example is shown in Code 19.2. In this case, the third column is somewhat
related to the first column from the code in line 1. This is slightly different than the data
in Equation (19.1) in that the two columns are not exactly the same but they are related.
The covariance matrix is computed and as seen the off-diagonal elements for C1,3 and
C3,1 are much larger than the other off-diagonal elements indicating that there is a strong
relation between the first and third elements of a vector. In fact, these value rival the
magnitude of the diagonal elements which indicates that this relationship is quite strong.
The fact that the elements are positive indicate that the first and third elements rise and
fall in value in unison.
19.2.2 An Example
The covariance matrix of actual data provides insight into the inherent first-order rela-
tionships. Consider the case of a covariance matrix of the codon frequencies of a genome.
When creating a gene the DNA is considered in groups of three which are named codons.
Since there are four letters in the DNA string, there are 64 different combinations of three
letters. Thus, there are 64 different codons. A codon frequency vector is the frequency of
each codon in a single gene.
244
In this example, all of the genes of sufficient length from the genome of ureaplasma
parvum serovar (accession AF222894) are converted to codon frequency vectors. Genes
needed to be of sufficient length in order for the codon frequency vector to have meaning.
After this culling there were 560 codon frequency vectors and from these the covariance
matrix was computed. This created a 64 × 64 matrix which is too large to display as
numerical values. Instead the values are converted to pixel intensities and displayed in
Figure 19.4. This is a 64 × 64 image in which the brighter pixels indicate larger values.
Figure 19.4: Pictorial representation of the covariance matrix with white pixels representing the
largest values..
Regions that have bright pixels indicate that there is a positive covariance value. Of
course, there are positive values along the diagonal since those represent variances of a gene
with itself. The darker values indicate negative covariances where the popularity of some
codons is opposed in other genes. Each column (or row) in this image is associated with a
gene in the genome. Those columns with gray values indicate that the codon frequencies
of the associated gene are independent of the other genes. Those columns with many
bright or dark regions indicate that the associated gene has a frequency relationship with
the other genes.
19.3 Eigenvectors
The PCA computation will compute the eigenvectors and eigenvalues of the covariance
matrix and so this section reviews the theory of eigenvectors. The standard eigenvector-
eigenvalue equation is,
A~vi = µi~vi , (19.3)
245
where A is a square, symmetric matrix, ~vi is a set of eigenvectors and µi is a set of
eigenvalues where i = 1, ..., N and the matrix A is N × N . On the left hand side there is a
matrix times a vector and the result of that is a vector. On the right hand side is a scalar
times a vector which also produces a vector and, of course, the computations from both
sides must be equal. Thus, if the eigenvectors and values are known then the computation
on the right hand side is an easy way of finding the solution to the left hand side. This
equation produces a set of eigenvectors and eigenvalues. So, this equation is true of N
vectors and their associated values.
The numpy package provides an eigenvector solution engine. Code 19.3 creates a
matrix A that is square and symmetric (which emulates the type of matrices that will be
used in the PCA analysis). Line 9 calls the eig function to compute both the eigenvalues
and eigenvectors. Since A is 3 × 3 there are three values and vectors. The eigenvectors
are returned as columns in a matrix. Lines 18 and 19 show that Equation (19.3) holds for
the first eigenvalue eigenvector pair and similar tests would reveal that it also holds for
the other two pairs.
If the matrix A is real-valued, square and symmetric then the eigenvectors are
orthonormal. This means that each vector has a length of 1 (ortho) and that each vector
is perpendicular to all of the other vectors (normal). This, in fact, is the definition of
a coordinate system. Code 19.4 shows a couple of examples. Line 1 computes the dot
product of an eigenvector with itself which is 1, indicating that the length is also 1. Line
246
3 computes the dot product of two different eigenvectors and since that value is 0 the two
vectors are known to be orthogonal to each other.
The logic of PCA (principal component analysis) is to diagonalize the covariance matrix.
In doing so, the elements of the data become independent. If there are first order relation-
ships within the data then this new representation will often display these relationships
more clearly than the original representation. Diagonalization of the covariance matrix is
achieved through mapping the data through a new coordinate system.
The protocol for PCA is,
Consider the data set which consists of 1000 vectors in R3 . The distribution of data
is along a diagonal line passing through (0,0,0) and (1,1,1) with a Gaussian distribution
about this line centered at (0.5, 0.5, 0.5) with standard deviations of (0.25, 0.05, 0.25) in
the respective dimensions. Two views of the data are shown in Figure 19.5.
(a) y vs x. (b) z vs x.
The first eigenvector is (-0.501, -0.508, -0.700) which defines a line that follows the
247
long axis of the data. This is shown in Figure 19.6. This is the axis that has the minimal
covariance is quite similar to the example shown in Figure 19.1.
Removing this component from the data is equivalent of viewing the data along the
barrel of that axis which is shown in Figure 19.7. Now the second and third axes can be
determined. Both are perpendicular to the first and to each other. The second axis will
be along the longest distribution of this data and the third axis must be perpendicular to
it. Each axis attempts to accomplish the feat shown in Figure 19.1.
PCA uses eigenvectors to find the axes along the data distributions and in doing
so tend to diagonalize the covariance matrix. It should be noted that these axes are
dependent solely on first-order information. Higher order information is not detected
which is discussed in Section 19.6.
19.4.1 Selection
The computation will compute N eigenvectors where N is the original number of dimen-
sions. So, at this stage there is no reduction in the number of dimensions. The eigenvalues
248
indicate which eigenvectors are the most important and are usually computed in order
of eigenvalue magnitude. A typical plot of eigenvalues is shown in Figure 19.8 where the
y axis the magnitude of the eigenvalues. Those eigenvalues that are small are related to
eigenvectors that are less important and it is these eigenvectors that can be discarded.
The choice of how many eigenvectors to keep is up to the user and that is based on how
sharply the curve bends and how much error the user can allow.
Some computational systems like Matlab returns the eigenvalues in order of magni-
tude. This is not necessarily so in Python. The computation naturally tends to produce
the eigenvectors and eigenvalues in that order, but in some cases this is not so. So, it is
important to look at the values of the eigenvalues before making the selection of which
eigenvectors to keep.
19.4.2 Projection
The new coordinate system is defined as the eigenvectors that are kept. Once the new
coordinate system is defined it is necessary to map the data to the new system. Since
dot products are also projections they are used to perform the mapping. For a single
data vector the location in the new coordinate system is the dot products with all of the
eigenvectors,
zi = ~vi · ~x, ∀i. (19.4)
Here the i-th eigenvectors is ~vi and ~x represents one of the data vectors. The output is a
vector ~z which is the location of the data point in the new space. This equation can be
applied to the data used in creating the covariance matrix as well as other data. So, once
the coordinate system is defined it is quite possible to place non-training data in the new
space.
249
19.4.3 Python Codes
All of the parts are in place and so the next step is to create a cohesive program. The PCA
function shown in Code 19.5 receivse the data matrix and the number of dimensions to
keep. The data matrix, mat, contains the original data in its rows. The covariance matrix
is computed in line 4 and the eigenvectors are computed in line 5. The coefficents, cff,
are the locations of the data points in the new space. The input D is the number of
dimensions that the user wishes to keep. The eigenvectors that are associated with the
D largest eigenvectors are kept in a matrix named vecs. These are used to compute the
location of the data points in line 14. This function returns the cffs matrix in which each
row is the new location of a data point and vecs are the eigenvectors that were used.
The PCA function determines the new location of the data that was used in training.
However, data not used in creating the PCA space can also be projected into this new
space. This projection is similar to the projection of the training data as shown in in line
14 of Code 19.5. However, the computation of the eigenvectors are not required as the
eigenvectors from PCA will be used.
Code Projectshows the Project function which maps vectors into the new space.
The inputs are the eigenvectors returned from PCA and the new data vectors which
are stored in datavecs. This variable can be a tuple, list or matrix in which the data
is contained in the rows. The output is a new matrix named cffs which contains the
locations of only the data vectors that were in datavecs.
250
Code 19.6 The Project function.
1 # dimredux . py
2 def Project ( evecs , datavecs ) :
3 ND = len ( datavecs )
4 NE = len ( evecs )
5 cffs = np . zeros ( ( ND , NE ) )
6 for i in range ( ND ) :
7 a = datavecs [ i ] * evecs
8 cffs [ i ] = a . sum (1)
9 return cffs
The projection of the points into a new space should not rearrange the points. The only
change is that the viewer is looking at the data from a different angle. Thus, the distances
between pairs of points should not change. This idea then makes a good test to determine
if the projection has changed the relationship among the data points. To demonstrate this
point the function AllDistances is used. This is shown in Code 19.7 and measures the
Euclidean distance between all pairs of vectors. In the case where there are N vectors the
2
number of pairs is N 2−N where N .
Given a set of data which has 20 vectors each with 10 elements that have an average
value of 89 the data is mapped into PCA space as shown in line 1 in Code 19.8. Line 2
computes the distances between all pairs of points in the original data and line 3 computes
the same for all pairs of points in the PCA space. Since the PCA projection is merely
a shift and rotation none of the distances should change. Lines 4 and 5 show that the
maximum difference is a very small number that is below the precision of computation.
This shows that none of the distances between any pair of data points changed in the
projection.
251
Code 19.8 The distance test.
1 >>> cffs , vecs = dmr . PCA ( data , 10 )
2 >>> a = dmr . AllDistances ( data )
3 >>> b = dmr . AllDistances ( cffs )
4 >>> abs ( a-b ) . max ()
5 3.7400 65 19 62 11 61 71 e-06
Consider the image shown in Figure 19.9 which is from the Brodatz image set that has
been used a library for texture recognition engines. Each row in this image is considered
as a vector as an input to the PCA process. Line 2 of Code 19.9 loads this image as a
matrix. Lines 3 through 5 complete the PCA process using only the first two eigenvectors.
The original image is 640 × 640 thus producing 640 vectors in a 640 dimensional
space. The matrix ndata is the projection of that data to a two dimensional space. These
points are plotted in Figure 19.10. Each point represents one of the rows in the original
image. The top row is associated with the point at (-584, -66).
The original image had the quality that consecutive rows had considerable similar-
ity. This is evident in the PCA plot as consecutive points are nearest neighbors. The
line connecting the points shows the progression from the top row of the image to the
bottom. The clump of points to the far left are associated with the bright knothole in the
image. This feature of similarity leads to an example that demonstrates that most of the
information is contained within the first few dimensions of the PCA space.
252
Code 19.9 The first two dimensions in PCA space.
253
In this example, the rows of the original information will be shuffled into a random
order. These shuffled vectors are then projected into PCA space. The next step will
find the nearest neighbors in the PCA space and use that information to reconstruct the
original image.
The image rows are shuffled by the ScrambleImage function shown in Code 19.10.
Line 5 creates a vector that are the indexes of the rows. These are shuffled in line 6 thus
creating a random order in which the rows will be arranged. The variable seedrow is
the new location of the first row of the image. This will be used to start the reassembly
process. The scrambled image is shown in Figure 19.11.
1 # pca . py
2 def ScrambleImage ( fname ) :
3 mgdata = sm . imread ( fname , flatten = True )
4 V , H = mgdata . shape
5 r = np . random . rand ( V )
6 ag = r . argsort ( )
7 sdata = mgdata [ ag ]
8 seedrow = list ( ag ) . index (0)
9 return sdata , seedrow
Each row is considered as a vector and the PCA process is used to remap these
vectors into a new data space. Line 1 in Code 19.11 scrambles the rows and line 2 projects
this scrambled data into a PCA space. The data points in this PCA are in the same
254
location as in Figure 19.10, but the lines can not be drawn between the points.
Code 19.12 shows the function Unscramble which performs the reconstruction of
the image. The inputs are the scrambled data, sdata, the location of the first row of the
image in the scrambled data, seedrow, and the projected data, ndata. Currently, all 640
dimensions are contained in ndata but these will be restricted in subsequent examples.
The variable udata will become the unscrambled image and the first row is placed in line
5. The list unused maintains a list of rows that have not been placed in udata. So, the
first row is removed in line 7. The variable k will track which row is selected to be placed
into udata.
Line 11 computes the Euclidean distance from a specified row to all of the other
unused rows. Thus, in the first iteration k = 0 and so this computes the distance to all
other rows. However, this is using the projected data. Basically, it is finding the closest
point in the PCA space shown in Figure 19.10. However, the plot shown in the figure
displays only 2 dimensions out of 640.
Line 12 finds the smallest distance and thus finds the vector that is closest to the k
row. The corresponding row of data is then placed in the next available row in udata and
the vector in PCA space is removed from further consideration in line 15.
255
In the first example, all 640 dimensions of the projected space are used. Thus, there
should be absolutely no loss of information. The call to the function is shown in line 1 of
Code 19.13. The output udata is an exact replicate of the original image.
However, not all of the dimensions in the PCA space are required. Consider the plot
of the first 20 eigenvalues shown in Figure 19.8. When data is organized the eigenvalues
fall rapidly thus indicating the importance of each eigenvector. Commonly, the number of
eigenvectors to use is the location of the elbow in this curve.
Line 2 in Code 19.13 reconstructs the image using only 7 of the 640 eigenvectors.
The result is nearly perfect reconstruction with only a few rogue lines at the bottom of
the image. The result is shown in Figure 19.12. This used the data points projected into
an R7 space and then computed the Euclidean distances between the projected points in
that space. The few rows at the bottom were probably rows that were skipped during
reconstruction.
Line 3 in Code 19.13 reconstructs the image using only 2 of the 640 eigenvectors. The
result is shown in Figure 19.13. As seen there are a few more errors in the reconstruction,
but most of the reconstruction is in tact. This is not a surprise since more than two
eigenvalues had significant magnitude in Figure 19.8. However, even with this extreme
reduction in dimensionality most of the image could be reconstructed. This indicates that
256
even in the reduction from 640 dimensions to 2 that there was still a significant amount
of information that was preserved. In some applications of PCA this loss of information
is not significant in the analysis that is being performed.
Data for this example starts with the image in Figure 19.14. This image is 480 × 640 and
each pixel is represented by 3 values (RGB). The data set is thus 307,200 vectors of length
3. The task in this example is to isolate the blue pixels. At first this sounds like a simple
task which it is for humans. However, since the blue pixels have a wide range of intensities
performing this task with RGB data is not as simple.
257
It is possible to contrive an equation that will attempt this isolation,
(
b b
1 g+1 > 1.5 and r+1 > 1.5
m= . (19.5)
0 Otherwise
The pixels isolated by this process are shown in Figure 19.15. The LoadRGBchannels
function in Code 19.14 loads the image and returns three matrices representing the red,
green, and blue components. The IsoBlue function performs the attempt at isolating the
blue pixels from Equation (19.5).
1 # pca . py
2 def LoadRGBchannels ( fname ) :
3 data = sm . imread ( fname )
4 r = data [: ,: ,0]
5 g = data [: ,: ,1]
6 b = data [: ,: ,2]
7 return r ,g , b
8
9 def IsoBlue ( r ,g , b ) :
10 ag = b /( g +1.0) >1.5
11 ab = b /( r +1.0) >1.5
12 isoblue = ag * ab
13 return isoblue
The data is in R3 and this is shown in Figure 19.16(a) where the green points are
those isolated in Figure 19.15 and the red points are the other pixels. Figure 19.16(b)
show the first two axes of this plot. As seen there is a separation of the isolated pixels
from the others and thus finding a discrimination surface is possible. It should also be
noted that the green points in the plot are those in Figure 19.15 which is not solely the
blue pixels.
258
(a) Map in R3 . (b) Map in R2 .
Figure 19.16: Displaying the original data. The green points are those denoted in Figure 19.15.
The first two axes of the PCA projection of this data is shown in Figure 19.17. As
seen the plane that divides the two sets of data is almost horizontal. Recall, however, that
the green points are not actually the set of blue pixels but an estimate.
The horizontal plane is around x2 = 0.45 and so the next step is to just gather those
points in the new space in which x2 ≥ 0.45 (where x2 is the data along the second axis).
Figure 19.18 shows the qualified points and clearly this is a better isolation of the blue
pixels.
The mapping to PCA space did not drastically change the data. It did, however,
represent the data in a manner such that a simple threshold (only one axis) could isolate
the desired data.
259
Figure 19.18: Points isolated from a simple threshold after mapping the data to PCA space.
Consider a system that contains a state vector that is altered in time through some sort of
process such as the changes in protein population within a cell. Each element state vector
describes the population of a single protein at a particular time. As time progresses the
populations change which is described as changes in the state vector.
Eigenvectors are a useful tool for describing the progression of a state vector in an
easy to read format. In this case the state vector is v[t] and the machine that changes the
state vector is a simple matrix M. In reality the machine that changes the state vector
can be far more complicated than a linear transformation described by a single matrix.
The progress of the system is then expressed as,
Code 19.15 runs the system for 20 iterations storing each state vector as a row in
data. The matrix M is forced to have a zero sum in Line 2 so that it does not induce
energy into the system.
This system contains 20 vectors and it is not easy to display all of the information.
The plot in Figure 19.19 shows the just of few of the data vectors. The first element
increases in value as time progresses. Some of the others increase and some decrease. Cer-
260
tainly, if the system contained hundreds of vectors and the relationships were complicated
then it would be difficult to use such a plot to understand the system.
In this case the first two eigenvalues are used in computing the PCA space. The
resultant data is plotted in Figure 19.20. The 20 data points represent the state of the
system at the 20 time intervals. The first point cffs[0] is close to 0,0 and in this case
the system is seen to create an outward spiral.
The outward spiral indicates that the values in the system are increasing in magni-
tude. If this were to continue then the values of the system would approach infinity. This
is an unstable system. An inward spiral would indicate that the system is tending towards
a steady state in which the state vector stops changing.
The most interesting cases are where the spiral does not expand outward or go
inward. The system draws overlapping circles (or other types of enclosed geometries).
261
This indicates that the system has obtained a stable oscillation. If the path exactly
repeats its path then the oscillations are exactly repeated. If the path stays within a finite
orbit then it describes a limited cycling of the system.
Code 19.16 generates a system in which values are not allowed to exceed a magnitude
of 1 and is plotted in Figure 19.21. It starts in the middle and quickly begins looping to
the left. This system was run for 1000 time and it gets into an oscillation. There is a
regular cycle that repeats about every 20 time steps. The hard corners appear because
the system forces values to be no greater than 1 and this is a nonlinear operation. The
corners occur when one of the elements of the state vector drastically exceeds 1 and the
nonlinear restriction is employed.
In a sensitivity analysis the cffs are just five data points in R2 space. The plot in
Figure 19.22 shows a set of +’s which represent the five dimensions of the first system.
The *’s represent the same five dimensions in the second system. The two data points
that moved apart the most are located around x = 5, y = −8. Printing the cffs it is seen
which variable this is. The conclusion to draw is that the change in the system affected the
second variable more than the others. Likewise, the change in the system barely affected
the first and fourth variables.
262
Figure 19.22: Sensitivity analysis of the data.
Consider a case in which the data consists of several images of a single face at different
poses. In this case, the face is a computer generated face and so there are no factors such
as deviations in expression, hair, glasses, etc. Figure 19.23 shows data mapped to a PCA
space.
As seen the pose of the face gradually changes from left to right. A possible conclu-
sion is that the PCA was capable of mapping the pose of the face.
This would be an incorrect conclusion. PCA can only capture first order data. In
other words it can compare pixel (i, j) of the images to each other but not pixel (i, j)
with pixel (k, l). The reason that the faces sorted as they did in this example is more of
a function of the location of the bright pixels in the images. The idea of “pose” is not
captured and it is only that the face was illuminated from a single source that there was
a correlation between the pose and the regions of illumination.
19.7 Summary
The principal components are a new set of coordinates in which the data can be repre-
sented. These components are orthonormal vectors and are basically a rotation of the
original coordinate system. However, the rotation minimizes the covariance matrix of the
data and thus some of the coordinates may become unimportant. In this situation these
coordinates can be discarded and thus PCA space uses fewer dimensions to represent the
data than the original coordinate system.
Principal components can be computed using an eigenvector engine or singular val-
ued decomposition. The NumPy package offers both and the interface is quite easy.
263
Figure 19.23: PCA map of face pose images.
Eigenvectors are also used to explore the progression of a system. Limit cycle plots us-
ing eigenvectors indicate if the system is shrinking, expanding, or caught in some sort of
oscillation.
Problems
1. Given a set of N vectors. In this case the eigenvalues of this set turn out to be
1,0,0,0,... What does this imply?
2. Given a set of N vectors. In this case the eigenvalues of this set turn out to be 1, 1,
1, 1... What does this imply?
3. Given a set of purely random vectors, describe what you expect the eigenvalues to
be. Confirm your prediction.
4. Given a set of N random vectors of length D. Compute the covariance matrix. Com-
pute the eigenvectors. Compute the covariance matrix of the eigenvectors. Explain
the results.
5. Repeat the work to obtain Figure 19.21, but add ±5% noise to each iteration. Ex-
plain the new system plot.
264
Chapter 20
Codons are three nucleotides that are used to by the cell to determine which amino acid
to attach to a chain as the protein is created. There are 64 different codons but only 20
different amino acids which means that many amino acids have multiple associated codons.
It is therefore possible that some genomes favor one codon over another in the DNA when
producing a gene. If this is true then it is possible to classify genomes according to their
codon frequencies. This chapter will explore this concept and show that for bacteria this
classification is achievable.
Figure 16.4 shows the conversion from codons to amino acids. Each codon is a set of three
nucleotides and the DNA for a gene should be a length that is divisible by three.
To compute the codon frequencies the number of occurrences of each codon is ob-
tained and these counts are divided by the total number of codons. So for a single codon,
the frequency is,
ci
fi = , (20.1)
N
where ci is the number of times that codon i was seen and N is the total number of codons.
The first step in counting the codons is to create a list of all of the possible codons. Once
set the order should not be changed. The function CodonTable shown in Code 20.1
creates a list of strings which are the 64 codons.
The function is called in line 13, and it returns a list of 64 strings of which the first
4 are printed to the console. This is the complete list of codons.
265
Code 20.1 The CodonTable function.
1 # codonfreq . py
2 def CodonTable () :
3 abet = ' acgt '
4 answ = []
5 for i in range ( 4 ) :
6 for j in range ( 4 ) :
7 for k in range ( 4 ) :
8 codon = abet [ i ] + abet [ j ] + abet [ k ]
9 answ . append ( codon )
10 return answ
11
The next step is to count the number of codons in a string. This is performed in the
function CountCodons shown in Code 20.2. The inputs are a DNA string for a gene and
the codon list created by CodonTable. Line 3 gets the length of the input string and
line 4 creates a vector with 64 elements currently all set to 0. This will hold the counts
of the codons. Line 5 starts the loop which goes from the beginning to the end of the
string but stepping every three bases. Thus the index i is only at the beginning of each
codon. Line 6 extracts a single codon and line 8 finds out which position this codon is
in the codons list. The variable ndx is an integer that corresponds to the location of the
codon in codons. The first codon in the list is ‘aaa’ and so it cut were also ‘aaa’ then ndx
would be 0. Line 9 then increments the value in the vector for that position. In this way,
the vector begins to accumulate the number of times each codon appears. Line 12 calls
this function and the variable cts is a vector of 64 elements that are the counts of each
codon in the string dna.
Line 7 may seem unnecessary at first. However, there are other letters in a DNA
string other than A, C, G, or T. These letters indicate that a nucleotide does exist at the
position but it is not known as to what it is. So, line 7 makes sure that the codon consist
of only the four letters before it is counted.
It is also possible to create count as a list instead of a vector. However, to compute
the frequencies the counts will all be divided by a single value. Therefore, a vector is
a better choice for containing the data. The most important point is that the order of
266
Code 20.2 The CountCodons function.
1 # codonfreq . py
2 def CountCodons ( dna , codons ) :
3 N = len ( dna )
4 counts = np . zeros ( 64 )
5 for i in range (0 , N , 3 ) :
6 cut = dna [ i : i +3]
7 if cut in codons :
8 ndx = codons . index ( cut )
9 counts [ ndx ] += 1
10 return counts
11
codons can not be changed in later processing or the counts will no longer correspond to
the correct codons.
Computing the codon frequencies easily performed by dividing the counts by the total
number of counts. Code 20.3 shows the division of the vector by the sum of the vector in
line 1. This is the codon frequencies, and one property of a frequency vector is that the
sum is 1.0 which is shown to be true.
Since this set of commands will be called multiple times it is prudent to create a
driver function. Code 20.4 shows the function CodonFreqs which does just that. It
creates the codon table, counts the codons and then computes the frequencies.
A genome has several genes and the codon frequencies can be computed for each gene
that has a sufficient length. Short genes are not used because the frequency vector is
meaningless if there are only a few codons. For example, if the gene has less than 64
codons then it is impossible to get a frequency for all codons. So for this section the
minimum number of codons is 3 × 64 = 192.
267
Code 20.4 The CodonFreqs function.
1 # codonfreq . py
2 def CodonFreqs ( dna ) :
3 codons = CodonTable ()
4 cts = CountCodons ( dna , codons )
5 freqs = cts / cts . sum ()
6 return freqs
The frequency vectors for an entire genome are obtained by the GenomeCodon-
Freqs function shown in Code 20.5. The input is the Genbank file name. Line 3 reads
the data, line 4 obtains the entire DNA string, line 5 gets the keyword locations and line 6
obtains the location of all of the genes. Now the genome is read and ready to be analyzed.
In the for loop the variable g is one of the elements in the list glocs. Line 9 gets the
coding DNA for a single gene. If this length is greater than 192 then the codon frequencies
are computed and stored in a list named frqs. The call to this function is shown.
1 # codonfreq . py
2 def GenomeCodonFreqs ( fname ) :
3 data = gb . ReadFile ( fname )
4 dna = gb . ParseDNA ( data )
5 klocs = gb . FindKeywordLocs ( data )
6 glocs = gb . GeneLocs ( data , klocs )
7 frqs = []
8 for g in glocs :
9 cdna = gb . GetCodingDNA ( dna , g )
10 if len ( cdna ) >= 192:
11 f = CodonFreqs ( cdna )
12 frqs . append ( f )
13 return frqs
14
The number of frequency vectors must be the same or less than the number of genes.
In this case, there are 1110 genes and 1019 genes are of sufficient length. Just 91 genes
were too short to be used.
268
20.2 Genome Comparison
This section will compare the codon frequency distribution for two genomes.
A single genome has many genes and so there is a distribution of values for each codon
frequency. The function Candlesticks creates a file that will display the statistics of the
codon frequencies for an entire genome. The call to this function is shown in Code 20.6.
This function receives the list of frequency vectors and a name used to write the data to
a file. The third argument is 0 which is the amount of horizontal shift used in the plot.
This is used when plotting more than one genome on the same plot as seen in the next
Section. This file can be read by GnuPlot or spreadsheets which can then create the plots.
The results are shown Figure 20.1 which shows 64 different bars. Each bar has box
and whiskers. The extent of the box is the average plus and minus the first standard
deviation. Almost 70% of the frequency values fit within the range of the box. The extent
of the whiskers show the highest and lowest frequency values. The short bars correspond
to the codons that are very infrequent in this genome.
269
20.2.2 Two Genomes
Now that the procedure has been established comparing two genomes is straightforward.
Code 20.7 shows the process of comparing two genomes. Line 2 gathers the frequency
vectors for the first genome and line 3 creates the files suitable for plotting. Lines 5 and
6 repeat the process for a second genome. The last argument in line 6 is 0.3 which shifts
the plots of the second genome 0.3 units to the right. In this manner the two plots do not
overlap but are side-by-side. The result is shown in Figure 20.2. Only the first 20 of the
64 codon frequencies are shown for clarity. Otherwise, the plot becomes too dense to see
the details.
Code 20.7 Creating plots for two genomes.
Figure 20.2: The statistics for the first 20 codons for two genomes.
The third codon shows that the two boxes have a very small overlap. This indicates
that the frequency of this codon is very different for the two genomes. Two other codons
in this view also have very little or no overlaps. This plot is showing less than 1/3 of the
total number of codons.
Thus, given a codon frequency vector randomly selected from the two genomes it is
possible to determine which genome it came from by examining the frequencies of a few
270
decision codons.
The plot in Figure 20.2 shows only a part of the comparison of just two genomes. Com-
paring multiple genomes requires a different analysis technique. For this task, PCA will
be used as proposed by Kanaya et al.[Kanaya et al., 2001].
The protocol for this experiment is:
The result is shown in Figure 20.3. Each genome is assigned a different color.
The data was sufficiently organized in the PCA representation that each genome has its
own isolated territory. This indicates that codon frequencies are sufficient for classifying
bacterial genomes.
Some genomes do overlap in this view. However, this only the first two PCA axis and
it is always possible that the groups that appear to overlap in this view are not actually
overlapping which can be seen in other views.
In this particular case, some of the clouds do overlap and no particular view will
contradict this point. This means that the two genomes are quite similar with respects to
their codon frequencies. This too is important information.
271
Projects
1. This chapter applies PCA analysis to the codon frequencies of bacteria. In this
project repeat the process for another set of genomes. For example, a project may
compare the codon frequencies for mammalian genomes.
2. This chapter reviewed the process for the coding regions of the bacterial genomes.
Genomes have evolved over time and so the non-coding regions are related to their
ancestors. Determine genomes can be separated by the PCA process for codon
frequencies in non-coding regions of the bacterial genomes.
272
Chapter 21
Sequence Alignment
DNA sequences are complicated structures that have been difficult to decode. A strand of
DNA contains coding regions which produce genes and contains non-coding regions which
may or may not have functionality. As systems evolve genes were passed on sometimes
with small alterations or relocations. Since the non-coding regions are less important in
many respects they were often passed on with more alterations. These similarities allow
us to infer functionality of a gene by relating it to other genes with known function.
The main computational technique for accomplishing this comparison is to align
sequences. The purpose of alignment is to demonstrate the similarity of two (or more)
sequences. At first this sounds like an easy job. Each sequence has only four bases and it
should not be too hard to determine if the sequences are similar.
Like in most real-world problems, it is not that easy. Two sequences can differ
because of base differences. Two sequences can differ by having extra or missing bases.
Computationally, this becomes a more difficult problem to solve since smaller chunks of
the sequences will need to align differently. Another problem is that in a DNA strand the
beginning and ending of coding regions may not be known. Thus, between two strands
the important parts may be similar and the unimportant parts could be dissimilar which
is perfectly acceptable and should not deteriorate the score of the alignment of coding
regions. Another problem is that parts of the coding regions can be located in different
regions in a strand. For example a gene may be constructed from two different subsections
of the strand. There is no guarantee that these two sections will be located in the same
regions of the two strands. Still, a computer program will need to find similarities among
these strings.
This chapter will consider simple alignment algorithms and review the highly used
dynamic programming approach. Other, more complicated, approaches will be discussed
but not replicated.
273
21.1 Simple Alignment
This section will begin the presentation of alignment techniques with a simple alignment.
Its many use is to define terms and show the deficiencies in believing that simple alignments
will be of much use.
21.1.1 An Alphabet
The algorithms contained here can be applied to strings from any source. Before they are
presented it is necessary to provide a few definitions. A string is an array of characters from
an alphabet. For the case of DNA the alphabet has only four characters (ACGT). Protein
sequences are made from an alphabet of 20 characters (ARNDCQEGHILKMFPSTWYV).
Certainly, strings from English text (26 letter alphabet) can be considered or from any
other language. Usually, the alphabet is represented with Σ, and for DNA the alphabet
is,
Σ = {A, C, G, T }. (21.1)
The first step that needs to be considered is how to assign a score to an alignment. When
two letters align they should contribute positively to the score and mismatches should
contribute negatively.
A perfect match occurs when aligned letters from two strings are the same. Even
in this simplicity questions of measuring the quality of the match need to be considered.
In the following case two simple sequences are perfectly aligned. Should the measure of
alignment treat all of the letters equally? Is an alignment of a sequence AATT with itself
more important than the alignment of ACGT with itself?
These questions can be further complicated by the considering the function of the
DNA. Multiple codons code for the same amino acid so should the alignment of CGA with
CGG (which are the same amino acid) be different than CGA with GGA?
Insertions and deletions (indels) are cases in which a base is added or removed from a
sequence. When identified these are denoted by a dash as in the following case.
ACGT
AC-T
The indels can arise from biological causes in which an offspring actually removes or
inserts a base. Other times indels can be caused by difficulties arising from the sequencing
process. It is possible that a base was not called correctly or that the signal was too
274
weak/noisy/imperfect to call the base. In any case the alignment process needs to consider
the possibility of indels. In the previous case the computer program would receive two
strings ACGT and ACT and would have to figure out that the deletion of a G has occurred.
This is a serious matter. If the sequences are very long (perhaps thousands of
bases) then there are thousands of locations where the indel can occur. Furthermore, the
sequences may have several indels and at one location multiple indels may need to be
considered. For sequences of significant length it is not possible to consider all possible
indels in a brute force computing fashion.
21.1.3.1 Rearrangements
Genes are encoded within DNA strands but a single gene may be coded in more than one
region of the strand. Thus, coding regions can contain non-coding regions within their
boundaries. Consider an example which has a strand consisting of xNxMx where x is a
non-coding region and the N and M are coding regions. It is possible that the distance
between N and M can change in another sequence. It is also possible that the new
sequence could be of the form xM xN x. Thus, during the alignment it may be necessary
to identify non-coding regions and lower their importance.
Given two sequences and the task of global alignment it is still necessary to be
concerned with the beginning and ending of the sequences. The sequencing technology
tends to have problems calling the very beginning and very end of sequences. Thus, the
actual sequence may be longer than necessary. For the case of xNx the leading and trailing
x part of the sequence may be any length and thus during the global alignment it may
still be necessary to exclude leading and trailing portions of a sequence.
Another complicating factor is sequence length. Often the alignment algorithms are based
on the number of matches. Consider a case in which the sequences are 100 elements long
and 90 of them align. Consider a second case in which the sequences are 1000 elements
long and 800 of them align. In the second case the score can be higher since many more
elements aligned, but the percentage of alignment is greater in the first case. So, some
algorithms consider the sequence length when producing an alignment score.
Aligning two strings sounds like a simple process but it has long ago been mostly aban-
doned in a majority of bioinformatics applications. This section will explore simple align-
ment algorithms and the reasons for more complicated engines.
275
21.1.4.1 Direct Alignment
This is an extremely simple concept. Given two sequences a score is computed by adding
a positive number for each match and a negative number is added for a mismatch. A
simple example with a total score of 4 is:
RNDKPKFSTARN
RNQKPKWWTATN
++-+++--++-+
An alignment between two different letters in this case counts as -1. An alignment
of a letter with a gap is also a mismatch but perhaps should be counted as a bigger
penalty, for example -2. In this fashion alignment with gaps is more discouraged than just
mismatched letters. Code 21.1 displays the function SimpleScore which performs this
comparison. In lines 3 and 4 the strings are converted to arrays in which each element
is a single letter. Line 5 counts the number of matching characters and line 6 counts the
number of mismatching characters. Line 7 removes the penalty for locations where a gap
is aligned with a gap. Some examples are shown.
1 # simplealign . py
2 def SimpleScore ( s1 , s2 ) :
3 a1 = np . array ( list ( s1 ) )
4 a2 = np . array ( list ( s2 ) )
5 score = ( a1 == a2 ) . astype ( int ) . sum ()
6 score = score -( a1 != a2 ) . astype ( int ) . sum ()
7 ngaps = s1 . count ( ' - ' ) + s2 . count ( ' - ' )
8 score = score - ngaps
9 return score
10
In reality the mismatched characters in aligning amino acid sequences are not counted
equally. Through evolution some amino acid changes are more frequent than others. This
276
Table 21.1: The BLOSUM50 matrix.
A R N D C Q E G H I L K M F P S T W Y V
A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 -2
R -2 7 -1 -2 -1 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -1
N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 -1 0 -4 -2 -2
D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -3
C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -3
Q -1 -1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -1
E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -2
G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -3
H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 -1 -1
I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 -1
L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 -1
K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -2
M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 0
F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 4
P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3
S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 -2
W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 2
Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 8
V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 -1
is also complicated by the fact that some amino acids are more commonly seen than others.
Analysis of these changes have led to the creation of substitution matrices. There are
several versions depending on the mathematical methods employed and the evolutionary
time step. The most popular matrices are either PAM or BLOSUM and the number that
follows the name indicates the evolutionary time step. The matrices have some differences
but are similar enough that only one matrix will be used here.
Since there are 20 amino acids the substitution matrix is 20×20. This matrix indicates the
log-odds of a substitution. The matrices may be presented in different arrangements and
so it is necessary to first define the alphabet that is associated with the use of a matrix.
In this case the alphabet is: ’ARNDCQEGHILKMFPSTWYV’.
The BLOSUM50 matrix is shown in Table 21.1. Each row and column is associated
with a letter. So the log-odds of a an ‘A’ changing to an ‘R’ is -2.
The odds of an event occurring is based on the probability of the event normalized
by the probability of the constituents existing. The range of the values for the odds can be
large and that is one of the reason that the log-odds are used. Negative values indicate that
277
the odds were less than 1 and positive values indicated that the odds were greater than
1. Thus, positive values in the table are events that are likely to occur with larger values
indicating a larger chance. As seen all of the events in which an item remains unchanged
(values down the diagonal of the matrix) are positive and are the largest values.
The score for an alignment of two amino acids comes from this table. For example,
a ‘D’ aligned with a ‘K’ has a score of -1. An example alignment is
R N D K P K F S T A R N
R N Q K P K W W T A T N
7 7 0 6 10 6 1 -4 5 5 -1 7
The blosum module contains both the BLOSUM50 matrix and the associated alpha-
bet. Code 21.2 shows a small part of the matrix and the entire alphabet.
Code 21.2 Accessing the BLOSUM50 matrix and its associated alphabet.
The next task is to obtain the correct value from the substitution matrix for a given
alignment. Consider the case in which the task is to compute the alignment score for
‘RNDKPKFSTARN’ with ‘RNQKPKWWTATN’. The first two letters are ‘R’ and ‘R’
and the task is to get the correct value from the BLOSUM matrix for this alignment.
The location of the target letter (in this case ‘R’) in the alphabet is shown in Code
21.3. Line 1 returns the location of the target letter and in this case this step is for both
strings. Line 3 then retrieves the value from the matrix for the alignment of an ‘R’ with
an ‘R’.
278
Table 21.2: Possible alignments.
Now consider the third letter in each string. The task is to align a ‘D’ with a
‘Q’. The process is shown in Code 21.4. The substitution matrix is symmetric and so,
blosum.BLOSUM50[3,5] = blosum.BLOSUM50[5,3].
Now, the process is reduced to repeating these steps for each character position in
the strings. This is accomplished with the BlosumScore function shown in Code 21.5.
The inputs are the substitution matrix with its alphabet, the two sequences, and a gap
penalty. Since the strings could be different lengths it is necessary to find the length of
the shortest string which is done in line 4. Then the loop begins and situations with gaps
are first considered. The alignment score for letters starts in line 11. The indexes of the
two letters are retrieved and they are used to get the value from the BLOSUM matrix.
The score is accumulated in the variable sc and returned at the end of the function. An
example is shown.
Commonly, the two sequences are not aligned, but rather the alignment needs to be de-
termined. The most simplistic and costliest method is to consider all possible alignments.
Consider the alignment of two small sequences abc and bcd. Table 21.2 shows the five
possible shifts to align the two sequences. The periods are used as place holders and do
not represent any data.
There are actually three types of shifts shown. The first two examples shift the
second sequence towards its right, the third example has neither shifted, and the last two
examples have the first sequence shifted to the right. It is cumbersome to create a program
that considers these three different types of shift. An easier approach is to create two new
strings which have the original data and empty elements represented by dots. In this case
the new string t1 contains the old string Seq1 and N2 empty elements (where N2 is the
length of Seq2). The string t2 is created from N1 empty elements and the string Seq2.
279
Code 21.5 The BlosumScore function.
1 # simplealign . py
2 def BlosumScore ( mat , abet , s1 , s2 , gap =-8 ) :
3 sc = 0
4 n = min ( [ len ( s1 ) , len ( s2 ) ] )
5 for i in range ( n ) :
6 if s1 [ i ] == ' - ' or s2 [ i ] == ' - ' and s1 [ i ] != s2 [ i ]:
7 sc += gap
8 elif s1 [ i ] == ' . ' or s2 [ i ] == ' . ' :
9 pass
10 else :
11 n1 = abet . index ( s1 [ i ] )
12 n2 = abet . index ( s2 [ i ] )
13 sc += mat [ n1 , n2 ]
14 return sc
15
t1 = ’abc...’
t2 = ’...bcd’
Finding all possible shifts for t1 and t2 is quite easy. By sequentially removing the
first character of t2 and the last character of t1 all possible shifts are considered. Table
21.3 shows the iterations and the strings used for all possible shifts. In this case iteration
2 would provide the best alignment.
The number of iterations is N1+N2-1 and the result of the computation is now a
vector with the alignment scores for all possible alignments. Code 21.6 shows the func-
tion BruteForceAlign which creates the new strings and considers all possible shifts.
The scoring of each shift is performed by BlosumScore but certainly other scoring func-
tions can be used instead. Since BlosumScore is capable of handling strings of different
lengths it is not necessary to actually create t2 in this function. This function uses the
multiplication sign with a string in line 4 to create a string or repeating characters (see
Section 6.4.1.3).
This function returned several values which are the alignment scores for every possi-
ble alignment between these two sequences. The best alignment is the one with the largest
score. The location indicates the shift necessary of one of the sequences in order to align
the sequences. Consider the case shown in Code 21.7. Lines 1 and 2 create two strings
which are similar except that one string also has a set of preceding ‘A’s. Thus, to obtain
the best alignment the string s1 needs to be shifted to the right by 5 spaces. The function
280
Table 21.3: Shifts for each iteration.
Iteration Strings
0 abc...
...bcd
1 abc..
..bcd
2 abc.
.bcd
3 abc
bcd
4 ab
cd
5 a
d
281
BruteForceSlide is called in line 3 and the set of values are returned as a vector v.
In this vector there is a single value that is much higher than all of the others. The
location of this maximum value is obtained by v.argmax(). This location depends on
the shift necessary to align the two strings and the lengths of the strings. This value is
computed in line 4. Since line 5 is positive line 6 is used to create aligned strings. This
command adds 5 periods in front of s1 so that it will aliwng with s2 as shown in lines 7
and 9.
A second example starts in line 11 which creates the same strings except that s1
now has the additional characters in front. The same process is called except in this case
the value in line 15 is negative. Thus, the periods are inserted in front of s2 in order to
get the two strings to properly align.
The previous system is slow, simple and effective as long as all of nucleotides are known
and that evolution has not changed the DNA string lengths. However, neither of these
are guaranteed to be true and often they are not. Therefore a more powerful method is
282
required.
Consider the alignment of two strings ‘TGCGTAG’ and ‘TGGTAG’. These two
strings are very similar and would align perfectly if there was an additional nucleotide
inserted into the second string at the third position. A gap is then inserted to space the
strings to align properly as in.
A = TGCGTAG
B = TG-GTAG
The difficulty in comparing two sequences then is knowing where to put the gaps.
Certainly, it is possible to attempt a gap at every location. In this case sequence A would be
compared to ‘TGGTA-G’, ‘TGGT-AG’, ‘TGG-TAG’, ‘TG-GTAG’, and ‘T-GGTAG’. This
is not a difficult task even for long strings. However, it is quite probable that gaps will be
needed in more than one places. So, to perform a thorough study the strings ‘T-GGTA-G’,
‘T-GGT-AG’, etc. would also have to be considered. Furthermore, it may be necessary
to consider more than one gap in a single location so strings such as ‘TGGTA–G’ and ‘T-
GGT–AG’ would have to be considered. There are also possibilities of more than two gaps
and also gaps may be necessary in the first string in the comparison. Aside from all of the
possible locations for the gaps, each comparison requires several shifts of the strings just
to find a best alignment. Obviously, the number of possible alignments is exponential with
respect to the strings lengths. Since many sequencing machines can provide information
for strings with over 300 nucleotides and exhaustive search is computationally prohibited.
There are multiple methods to adapt to this problem. Programs such as BLAST
start with small segments of alignments work towards larger alignments employing esti-
mations. This method is very fast but may not find the best alignments. It is used to
compare a DNA or protein string to a large library. Since the amount of data is vast there
are many alignments that are returned that can be studied. Often this information is suf-
ficient to understand the purpose of the query string even if some of the best alignments
were not returned by the program.
The method of dynamic programming does a much better job at inserting gaps for
the best alignment but it is computationally more expensive.
The dynamic programming approach attempts to find an optimal alignment by con-
sidering three options for each location in the sequence and selecting the best option before
considering the next location. Each iteration considers the alignment of two bases (one
from each string) or the insertion of a gap in either string. The best of the three are chosen
and the system then moves on to the next elements in the sequence.
To handle this efficiently the computer program maintains a scoring matrix and an
arrow matrix. This program will also use a substitution matrix such as BLOSUM or PAM.
283
21.4.1 The Scoring Matrix
The scoring matrix, S, maintains the current alignment score for the particular alignment.
Since it is possible to insert a gap at the beginning of a sequence the size of the scoring
matrix is (N1 + 1) × (N2 + 1) where N1 is the length of the first sequence. Consider
two sequences, S1 and S2 (Code 21.8) that contain some similarities. The lengths of the
sequences are N1 = 15 and N2 = 14 and thus the scoring matrix is 16 × 15.
Alignment with a gap is usually penalized more than any mismatch of amino acids, so
for this example gap = -8 but certainly other values can be used to adjust the performance
of the system. The alignment of the first character with a gap is a penalty of -8 and the
alignment with the first two characters with two gaps is -16, and so on. The scoring matrix
is configured so that the first row and first column considers runs of gaps aligning with the
beginning of one of the sequences. Thus, the first step in construction the scoring matrix
is to establish the first row and first column as shown in Figure 21.1.
Figure 21.1: The first column and row are filled in.
The next step is to fill in each cell in a raster scan. The first undefined cell considers
the alignment of I with Q or either one with a gap. There are three choices and the
selection is made by choosing the option that provides the maximum value,
Sm−1,n + gap
Sm,n = max Sm,n−1 + gap , (21.2)
Sm−1,n−1 + B(a, b)
284
where the B(a, b) indicates the entry from the scoring matrix for residues a and b.
Normally, the first entry in the matrix is denoted as S1,1 , but in order to be congruent
with Python the first cell in the matrix is S0,0 and the first cell that needs to be computed
is S1,1 . To be clear it should be noted that this cell aligns the first characters in the two
strings S1[0] and S2[0], thus the indexing of the strings is slightly different than the
matrix locations. In the example the cell S1,1 considers the alignment of the first two
letters in each sequence. With m=1, n=1, a=’I’, and b=’Q’ the first cell has the following
computation,
−8 − 8
S1,1 = max −8 − 8 . (21.3)
0−3
and the obvious choice is the third option. The results are shown in Figure 21.2.
It is necessary to keep track of the choices made for each cell. Once the entire scoring
matrix is filled out it will be necessary to use it to extract the optimal alignment. Thus,
the algorithm requires the use of a second matrix named the arrow matrix. The arrow
matrix is used to find which cell was influential in determining the value of the subsequent
cell. In the previous example, the third choice was selected which indicates that S0,0 was
the cell that influenced the value of S1,1 . The arrow matrix will place one of three integers
(0,1,2) in the respective cells and so R1,1 would contain a 2.
285
21.4.3 The Initial Program
The dynprog module contains several functions that are used in dynamic programming.
The first function is ScoringMatrix which creates the scoring matrix and the arrow
matrix in a straightforward manner. This function is shown in Code 21.9. It receives a
substitution matrix and its associated alphabet along with the two strings to be aligned,
and it returns the scoring matrix and the arrow matrix.
1 # dynprog . py
2 def ScoringMatrix ( mat , abet , seq1 , seq2 , gap =-8 ) :
3 l1 , l2 = len ( seq1 ) , len ( seq2 )
4 scormat = np . zeros ( ( l1 +1 , l2 +1) , int )
5 arrow = np . zeros ( ( l1 +1 , l2 +1) , int )
6 scormat [0] = np . arange ( l2 +1) * gap
7 scormat [: ,0] = np . arange ( l1 +1) * gap
8 arrow [0] = np . ones ( l2 +1)
9 for i in range ( 1 , l1 +1 ) :
10 for j in range ( 1 , l2 +1 ) :
11 f = np . zeros ( 3 )
12 f [0] = scormat [ i-1 , j ] + gap
13 f [1] = scormat [i , j-1] + gap
14 n1 = abet . index ( seq1 [ i-1] )
15 n2 = abet . index ( seq2 [ j-1] )
16 f [2] = scormat [ i-1 , j-1] + mat [ n1 , n2 ]
17 scormat [i , j ] = f . max ()
18 arrow [i , j ] = f . argmax ()
19 return scormat , arrow
These two matrices are one column and row bigger than the two input strings. These
new lengths are determined in line 3 and the matrices created in lines 4 and 5. The first
row and columns are populated in lines 6 and 7. The dynamic programming is computed
starting in line 11. This creates a three element vector f which will hold the three possible
computations for a single cell in the scoring matrix. The three possibilities are computed
in lines 12 through 16. The highest score is then selected to populate a cell in both the
scoring matrix and the arrow matrix in lines 17 and 18.
The program is called in Code 21.10, and the hard part of the dynamic programming
algorithm as been accomplished. This function does perform the steps but it also has a
double nested loop which in an interpreted language is slow. A faster version will be shown
in Section 21.4.5, but before that is reviewed the process of extracting the best alignment
is pursued.
286
Code 21.10 Using the ScoringMatrix function.
The final step is to extract the aligned sequences from the arrow matrix. The process starts
at the lower right corner of the arrow matrix and works towards the upper left corner.
Basically, the aligned sequences are created from back to front. Code 21.11 displays the
arrow matrix for the current example. In the lower right corner the entry is a 0 which
indicates that this cell was created from the first choice of Equation (21.2). It aligned the
last character of the first string with a gap and thus the current alignment is,
Q1 = ’A’
Q2 = ’-’
The value of 0 also indicates that the next cell to be considered is the one above the
current position since a letter from S2 was not used. This next location in the arrowmat
contains a 2 which indicates that two letters are aligned.
Q1 = ’DA’
Q2 = ’D-’
A value of 2 indicates that the backtrace moves up and to the left. Code 21.11
shows the arrow matrix and in bold are the locations used in the backtrace. Each time
a 0 is encountered a letter from S1 is aligned with a gap and the backtrace moves up on
location. Each time a 1 is encountered the letter from S2 is aligned with a gap and the
backtrace moves to the left. Each time a 2 is encountered a letter from both sequences
are used and the backtrace moves up and to the left.
Code 21.12 shows the BackTrace function. The two strings that are being aligned
are st1 and st2. The backtrace starts at the lower right corner and works it way up to
the upper left in the while loop starting Line 8. For each cell it appends a letter or gap
to each sequence depending on the value in the arrow matrix. There are four choices with
lines 9, 13, and 17 representing the three choices in Equation (21.2). The choice offered
in line 22 occurs when the trace reaches the top row or first column. Within each choice
there is an adjustment to st1 and st2 and then a change to the new locations v and h.
The strings are constructed in a reverse order and so the last two lines of code are used
to reverse the strings into the correct order. The example call shows the alignment of the
two test strings.
287
Code 21.11 The arrow matrix.
1 >>> arrow
2 array ([[1 , 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ,
3 [0 , 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ,
4 [0 , 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1] ,
5 [0 , 0, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1] ,
6 [0 , 0, 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1] ,
7 [0 , 0, 0, 0, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1] ,
8 [0 , 0, 0, 0, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1] ,
9 [0 , 0, 0, 0, 0, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1] ,
10 [0 , 0, 0, 0, 0, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1] ,
11 [0 , 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 1, 2, 1] ,
12 [0 , 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 2, 1, 1, 1] ,
13 [0 , 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 1, 1, 1] ,
14 [0 , 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 1, 2, 1, 1] ,
15 [0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 1] ,
16 [0 , 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 0, 2] ,
17 [0 , 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2, 0, 0]])
The ScoringMatrix does work but is slow. The reason is that Python is an interpreted
language and ScoringMatrix has a double nested loop. For a single alignment of se-
quences with about 300 bases the previous programs can find a solution in a reasonable
amount of time. If the project requires many dynamic programming applications then
speed becomes a serious issue. The goal then is to perform the same operations without
using nested Python loops.
Consider again function ScoringMatrix. Actually, this program is a triple nested
loop. There are the two for loops in lines 9 and 10, but the third loop is a bit covert.
The vector f has three elements and each of these are considered when computed lines 17
and 18. This loop is contained within the scipy functions max and argmax. The goal of
the functions are to perform the same operations but use scipy functions to perform some
of the loops, leaving only one loop written in Python.
This process is performed with three functions. The first is FastSubValues shown
in Code 21.13. The computation for each cell in the scoring matrix will require a value
from the substitution matrix. These values need to be extracted using a single Python
for loop. To accomplish this, random slicing techniques are employed (see Code 11.10).
The FastSubValues function requires a matrix which contains that BLOSUM or PAM
values that will be used at each location in the scoring matrix.
A partial result is shown starting in line 17. The first row and column in the scoring
matrix will not use substitution values and so those are 0. The first nonzero element
288
Code 21.12 The Backtrace function.
1 # dynprog . py
2 def Backtrace ( arrow , seq1 , seq2 ) :
3 st1 , st2 = ' ' , ' '
4 v , h = arrow . shape
5 ok = 1
6 v-=1
7 h-=1
8 while ok :
9 if arrow [v , h ] == 0:
10 st1 += seq1 [ v-1]
11 st2 += ' - '
12 v -= 1
13 elif arrow [v , h ] == 1:
14 st1 += ' - '
15 st2 += seq2 [ h-1]
16 h -= 1
17 elif arrow [v , h ] == 2:
18 st1 += seq1 [ v-1]
19 st2 += seq2 [ h-1]
20 v -= 1
21 h -= 1
22 if v ==0 and h ==0:
23 ok = 0
24 # reverse the strings
25 st1 = st1 [::-1]
26 st2 = st2 [::-1]
27 return st1 , st2
28
289
Code 21.13 The FastSubValues function.
1 # dynprog . py
2 def FastSubValues ( mat , abet , seq1 , seq2 ) :
3 l1 , l2 = len ( seq1 ) , len ( seq2 )
4 subvals = np . zeros ( ( l1 +1 , l2 +1) , int )
5 si1 = np . zeros ( l1 , int )
6 si2 = np . zeros ( l2 , int )
7 for i in range ( l1 ) :
8 si1 [ i ] = abet . index ( seq1 [ i ] )
9 for i in range ( l2 ) :
10 si2 [ i ] = abet . index ( seq2 [ i ] )
11 for i in range ( 1 , l1 +1 ) :
12 subvals [i ,1:] = mat [ [ si1 [ i-1]]* l2 , si2 ]
13 return subvals
14
290
is subvals[1,1] which has a value of -3. The first two letters in the two sequences
to be aligned are ‘I’ and ‘Q’. The alignment of these two letters is computed in the
scormat[1,1] and in order to make this computation the algorithm needs to substitution
value from BLOSUM for ‘I’ and ‘Q’. This value is -3 and is thus located in subvals[1,1].
The rest of the matrix is populated with the substitution values that are needed for that
cell’s computation. While there are three for loops in FastSubValues the speed is still
optimal as none of these loops are nested.
One of the issues in creating the scoring matrix is that it is not possible to compute
a single row or single column at the same time. A cell requires knowledge of the cell
above and the cell to the left. However, it is possible to compute all of the values along
a diagonal as shown in Figure 21.3. The elements in a contiguous line can be computed
concurrently. Four such lines are shown, but they would continue until the lower right of
the matrix is reached. The Python for loop is then moving from one line to the next.
The computations for all of the values on a line must then be performed without a Python
for loop.
Figure 21.3: The lines indicate which elements are computed in a single Python command.
The next step is to obtain the indexes of all of the elements along a line. The first
line has only a single entry and the index for that element is [1,1]. The second line has
two entries and those indexes are [1,2] and [2,1]. The third line has three entries and so
on. The pattern is quite easy except when the bottom or right of the scoring matrix is
reached. This also is dependent on the shape of the matrix. In this case there are more
columns than rows and so the last row will be reached before the last column.
291
The function CreateIlist receives the lengths of the two protein strings and then
returns a list of indexes. Again this function uses a single Python for loop.
Code 21.15 shows the purpose of CreateIlist by example. This example considers
strings of length 5 and 4. Line 1 calls the function. The first item in the list is the index
of the element for the first line similar to Figure 21.3. Line 5 shows the indexes for the
next line. The last line shows the case for the 5th diagonal line. In this example the last
row is reached and so the pattern is modified.
The final function is FastNW which is displayed in Code 21.16. This is similar to
ScoringMatrix in function, however there is only a single Python for loop. The variable
f is now a matrix which contains the three dynamic programming choices for all elements
along a diagonal. The size is 3× LI where LI is the number of cells along a diagonal. The
variable maxpos in line 21 is the dynamic programming choice (see equation (21.2)) for
each element along the diagonal. The index i is not an integer, but is one of the items
from the list ilist. So, lines 22 and 23 populate all of the elements in the scoring matrix
292
and arrow matrix along that diagonal.
Code 21.17 shows the two commands needed to create the scoring and arrow ma-
trices. These results are the same as those from ScoringMatrix but the computational
speed is significantly improved.
There are two common cases in the application of alignments. The first is that the begin-
ning and the end of the genes are known and so the two strings that are to be aligned have
293
a definite beginning and ending. This is a global alignment since the process aligns the
entirety of both strings. This is named Needleman-Wunsch alignment after the creators,
which is the reason that the fast function is called FastNW.
The second case is that the user has two strings of DNA and inside of these are
regions of interest but the beginning and ending of these regions is not known. Thus, the
user is interested in finding a portion of one string that aligns with a portion of the other.
This is called local alignment since the alignment result generally uses only a part of each
string. This is also called Smith-Waterman alignment and so the function that creates the
scoring and arrow matrices is named FastSW.
The Smith-Waterman algorithm is a local alignment process that attempts to find
the best substrings within the two strings that align. It accomplishes this through only a
couple of modifications. The first is to adjust the selection equation such that no negative
numbers are excepted,
Sm−1,n + gap
S
m,n−1 + gap
Sm,n = max . (21.4)
Sm−1,n−1 + B(a, b)
0
The programs as presented use a standard gap penalty. The cost of each gap is the same
independent of its location and independent of consecutive run of gaps. In some views
294
Code 21.18 Results from the FastSW function.
1 >>> sq1 = ' KMTIFFMILK '
2 >>> sq2 = ' NQTIFF '
3 >>> subvals = dpg . FastSubValues ( B50 , PBET , sq1 , sq2 )
4 >>> scmat , arrow = dpg . FastSW ( subvals , sq1 , sq2 )
5 >>> scmat
6 array ([[ 0 , 0 , 0 , 0 , 0 , 0 , 0] ,
7 [ 0 , 0 , 2 , 0 , 0 , 0 , 0] ,
8 [ 0 , 0 , 0 , 1 , 2 , 0 , 0] ,
9 [ 0 , 0 , 0 , 5 , 0 , 0 , 0] ,
10 [ 0 , 0 , 0 , 0 , 10 , 2 , 0] ,
11 [ 0 , 0 , 0 , 0 , 2 , 18 , 10] ,
12 [ 0 , 0 , 0 , 0 , 0 , 10 , 26] ,
13 [ 0 , 0 , 0 , 0 , 2 , 2 , 18] ,
14 [ 0 , 0 , 0 , 0 , 5 , 2 , 10] ,
15 [ 0 , 0 , 0 , 0 , 2 , 6 , 3] ,
16 [ 0 , 0 , 2 , 0 , 0 , 0 , 2]])
17 >>> scmat . max ()
18 26
19 >>> divmod ( scmat . argmax () , 7 )
20 (6 , 6)
295
a consecutive run of gaps should be more costly than isolated gaps. These approaches
use an affine gap which adds extra penalties for consecutive gaps. This does complicate
the program somewhat as now it is necessary to keep track of the number of gaps when
computing Equation (21.4). For the purposes of this text, this option will not be explored.
Dynamic programming can provide a good alignment but is it the very best? Consider
Code 21.20 in which two random sequences are generated that are each 100 elements in
length. A substring of 10 elements is copied from the first string and replaces 10 elements
in the second string. Thus, there are two random strings except for 10 elements that are
exactly the same, and the Smith-Waterman algorithm should align these 10 elements only.
Code 21.20 shows an example that returns sequences that are much longer than the
expected length of 10. The last elements match but the random letters in front of the
matching sequence do not.
Code 21.21 shows a worse example. The two sequences were generated randomly
as in Code 21.20. Code 21.21 shows what was generated and then the process of aligning
the sequences through Smith-Waterman. The returned sequences are much longer than
10 elements this time. The first ten elements match but the rest do not. This implies two
items. First, the largest value in the scoring matrix was not at the end of the 10 element
alignment but some other place 47 elements away from the end of the aligning strings.
The second item is that there were no 0 values in the scoring matrix from this peak to the
296
beginning of the aligning elements.
1 >>> s1
2 ' KKPGHWMVRCKQGQKRVGLNRYMDNYSSPKNHMVRDHFHLWKWMPSENC
3 PAECWADKLWYIMKSCPADQPFTALKQVIAQTEEQVNYNNVGAHMAADSCT '
4 >>> s2
5 ' GGFMEGCCTPMYARTCVCDHCIGRVSERINKQGQKRVGLNLVRHGILIW
6 HNFLVGNQVWPWLMECFQAAGSTNKVYIREVPQIRKAIDYSLQYTINIVYL '
7 >>> subvals = dpg . FastSubValues ( B50 , PBET , s1 , s2 )
8 >>> scmat , arrow = dpg . FastSW ( subvals , s1 , s2 )
9 >>> t1 , t2 = dpg . SWBacktrace ( scmat , arrow , s1 , s2 )
10 >>> t1
11 ' K Q G Q K R V G L N R Y M D N Y S S P K N H M V R D H F H L W K W M P S E N C -PAEC WADKLW YIMKSC P '
12 >>> t2
13 ' K Q G Q K R V G L N L V R H G I L I W H N F L V G N Q -V-WPWL-ME-CFQAAGSTNKV-YI-REVP '
Figure 21.4 shows an image of the scoring matrix in which the darker pixels represent
larger values in the matrix. The slanted streaks are jets that appear in the scoring matrix
when alignment occurs. The main jet is quite obvious and starts at scmat[10,31] because
the first two aligning elements are s1[10] and s2[31]. The alignment should be only 10
elements long with the jet ending at scmat[20,41] but the jet does not end there. Recall
that the desire is to have the largest value in the scoring matrix at the location where
the two alignments end. This is a large number and is used to influence the subsequent
elements in the scoring matrix via Equation (21.4). The third option in this equation will
have two non-similar characters and the value returned from the BLOSUM matrix may be
negative but not enough to return a 0 for this option. Alignments after that may return
positive values from the BLOSUM matrix and thus increasing the values in the cells of
the scoring matrix after the alignment has ceased to exist. This is not a trivial problem
as can be seen in Figure 21.4. The Smith-Waterman process returned a large number of
characters after the alignment, but this alignment was not terminated from the fading of
a jet. It was terminated because one sequence had reached its end.
While this method did return the true alignment it also can return alignments of
random characters. Thus, this is not the best alignment. It should also be noticed that by
viewing the scoring matrix in terms of its jets other possible alignments are seen. There
are secondary jets that indicate other partial alignments between these two sequences.
21.8 Summary
Sequences can have bases added or deleted either through biology or errors in sequencing.
The locations of these insertions or deletions are unknown and their numbers are also
297
Figure 21.4: A pictorial view of the scoring matrix. Darker pixels relate to higher values in the
matrix.
unknown. A brute force computation that considers all possible combinations of align-
ments with insertions and deletions is computationally too expensive. Thus, the field has
adopted dynamic programming as a method of finding a good alignment with gaps.
Creating a dynamic programming alignment can be accomplished by following the
algorithm’s equations. However, this creations a double nested loop which can run slow in
Python. Thus, a modified approach is used to push one of the loops down into the compiled
code and leaves only one loop up in the Python script. This makes the algorithm run fast
by at least an order of magnitude.
Problems
1. Create a random sequence and copy it. In the copy remove a couple of letters at
different places. Use NW to align these two sequences.
2. Repeat the above problem but change the gap penalty. Does the alignment change
if the gap penalty is -16? Does it change if it is -2?
Align two amino acid sequences (of at least 100 characters) using the BLOSUM50
matrix and the above M matrix. Are the alignments significantly different?
4. Modify the BlosumScore algorithm to align DNA strings such that the 3-rd element
in each codon is weighted half as much as the other two.
298
5. Create a string with the form XAXBX where X is a set of random letters and A and
B are specific strings designed by the user. Each X can be a different length. Create
a second string with the form YAYBY where Y is a different set of random letters
and each Y can have a different length. Align the sequences using Smith-Waterman.
The scoring matrix will have two major maximum for the alignments of the A and
B regions. Modify the program to extract both alignments.
6. Is it possible to repeat Problem 2 where the second string is of the form YABY?
7. Create a program which aligns two halves of strings. For example, the first string,
str1, can be cut into two parts str1a and str1b where str1a is the first K characters
and str1b are the rest of the string. The second string is similarly cut into two
parts str2a and str2b at the same K. Align str1a with str2a (using Needleman-
Wunsch) and str1b with str2b. For each alignment compute the alignment score
using BlosumScore. Is it the same value as the alignment of str1 with str2?
9. Align two proteins using a BLOSUM matrix. Replace the substitution matrix with
M where,
(
5 i=j
Mi,j =
−1 i =
6 j
Repeat the alignment using this substitution matrix. Does it make much of a differ-
ence?
299
300
Chapter 22
Simulated Annealing
This chapter is a precursor to machine learning techniques and explores the process of
learning through simulated annealing.
In some experiments the input variables are known and the output results are known.
The part that is missing is understanding the mathematical model that can compute the
outputs from the inputs. In some cases, the model can be from a learning algorithm
which may provide an engine that can compute outputs from inputs but provide a concise
understanding of how that can be accomplished.
The user can provide a mathematical model and allow the machine learning algo-
rithms to determine coefficients in that model. If the model is incorrect then the machine
learning algorithm will fail to provide meaningful results.
Consider the case of an experiment with one input x and one output y. Three
experiments are run and the results are shown in Table 22.1. This data is clearly not
linear and if the user used the model y = ax + b then a machine learning algorithm will
not be able to find the correct values a and b.
x y
1 0.4
2 1.5
3 3.3
301
22.2 Simulated Annealing
The cost function is unique to each problem and so this has to be written every time
there is a new application.
The second function is RunAnn which is shown in Code 22.2. The inputs are ~x
and N . Two optional inputs are the temperature temp and the annealing factor scaltmp.
These control the magnitude of the allowed changes and how fast this range shrinks.
Line 4 creates the initial random vector w~ and sets up the initial variables. Line 9
creates a new version of w~ called guess. It is based on random variations of the current
302
Code 22.2 The RunAnn function.
1 # simann1 . py
2 def RunAnn ( x , N , temp =1.0 , scaltmp =0.99 ) :
3 L = len ( x ) # number of elements in x
4 w = 2* np . random . rand ( L )-1
5 ok = True # flag to stop iterations
6 costs , i = [] , 0 # store costs from some iterations
7 cost = 999999 # start with some bogus large number
8 while ok :
9 guess = w + temp *(2* np . random . rand ( L )-1)
10 gcost = CostFunction ( x , guess , N )
11 if gcost < cost :
12 w = guess + 0
13 cost = gcost + 0
14 if i % 10 ==0:
15 costs . append ( cost )
16 i +=1
17 temp *= scaltmp
18 if cost <0.01 or temp <0.001:
19 ok = False
20 return w , np . array ( costs )
303
w.
~ The cost is computed in Line 10 and if that cost is better then w
~ becomes guess as in
Line 12 and the new cost is remembered in Line 13.
The cost of every 10th iteration is kept in a list named costs in Line 15. The
temperature controls the magnitude of the random variations in Line 9 and it shrinks a
little in Line 17. If the cost is low enough then the ok flag is set to False and the iterations
will cease. This function returns the final version of w ~ and the costs encountered from
every 10th iteration. These are plotted in Figure 22.1. Typical behavior is that the first
few iterations make great improvements and then it takes several iterations to make small
improvements
The process is run in Code 22.3. Line 2 runs the whole algorithm and Lines 3 and
4 show that the result did provided a vector that made ~x · w
~ = 2 to be True.
1 # simann1 . py
2 >>> w , c = RunAnn (x ,N ,3)
3 >>> np . dot ( w , x )
4 1.9998 04 30 19 98 93 18
304
22.3 A Perpendicular Problem
Consider a different case. How is it possible to tell if two vectors are perpendicular? One
of the properties of perpendicular vectors is that there dot product is 0,
?
~x · ~y = 0
~xi · w
~ =0 ∀i.
The cost function is the sum of how far away from 0 each of the dot products is.
The simulation is shown in Code 22.5 which is quite similar to Code 22.2. The cost
function is so simple that it is computed in a single line rather than a separate function.
The cost function is computed in Line 5. The rest of the algorithm is similar to the
previous case.
Code 22.6 shows the call to run the simulation and the results. Line 3 computes
the dot product of w with all of the x vectors. If the simulation were perfect then all of
the values shown would be 0. However, Line 18 of Code 22.5 allows the iteration to stop
before perfection is reached.
22.4 Speed
The speed at which the annealing occurs is important. This is controlled by scaltmp. If
it is too fast (lower values of scaltmp) then a solution will not be found. If it is too slow
(values very, very close to 1.) then the computations will take a long time. The command
in Code 22.7 will not produce a good answer because the decay is too fast.
305
Code 22.5 The modified RunAnn function.
1 # simann2 . py
2 def RunAnn ( vecs , temp =1.0 , scaltmp =0.99 ) :
3 D = len ( vecs [0])
4 target = 2* np . random . rand ( D )-1
5 cost = abs ( np . dot ( vecs , target ) ) . sum () # sum of inner
prods
6 ok = 1
7 costs100 , i = [] ,0
8 while ok :
9 guess = target + temp *(2* np . random . rand ( D )-1)
10 gcost = ( abs ( np . dot ( vecs , guess ) ) ) . max ()
11 if gcost < cost :
12 target = guess + 0
13 cost = gcost + 0
14 if i % 100 ==0:
15 costs100 . append ( cost )
16 i +=1
17 temp *= scaltmp
18 if cost <0.001 or temp <0.01:
19 ok = 0
20 return target , np . array ( costs100 )
306
22.5 Meaningful Answers
The computer algorithm will always produce an answer, but it may not be a good answer.
The problem may be the decay speed or an incorrect model. It is always smart to test the
answer.
Previously, it was stated that if the number of dimensions was N then there are
N − 1 random vectors that are used. Consider a case in which the number of vectors is
N + 1. According to the theory, this should not work. There should not be a vector w~
that is perpendicular to all of the ~x vectors.
Code 22.8 shows the test in which 12 vectors of length 10 are created. These are
used as inputs to find the vector that is perpendicular to all 12. Of course this should fail.
However, the worst dot product of w ~ with an ~x vector is close to 0. This indicates that
the test which should have failed was actually successful. How is this possible?
Nothing went wrong. There is a vector whose dot product is 0 to all 12 vectors.
~ were 0 then ~xi · w
This vector is all zeros. If all elements in w ~ = 0 would be true for ll
vectors. However, this not really a vector and even though the math held the answer is
not a valid one.
The point is that the algorithm did provide an answer, but it is up to the user to
determine if this answer obtains their goals.
The simulated annealing algorithm continually attempts to find a better solution. The
solution space is often considered to be a an energy surface. More cost is the same idea as
more energy. So, the idea is to lower the energy by lowering the cost. A two-dimensional
energy surface is shown in Figure 22.2 and the goal would be to find a solution at the
lowest point in the surface.
Simulated annealing starts with single, random solution which is equivalent to plac-
ing a ball at a random place on the surface. The process then tries a new solution by
slightly altering the old solution and this is the same as moving the ball a small distance
in one direction. If the height of the ball is lowered then the proposed solution is better
and it replaces the current solution. The process continues until the solution can not get
much better or other criteria.
307
Figure 22.2: An energy surface.
308
The energy surface that is shown is not too difficult and probably any starting
location would lead to the same solution. The energy surfaces in real problems, though,
are not mapped out and may have many different wells. So, it is very possible to have a
solution go towards the nearest well but that is not the deepest well. The term for this is
getting stuck in a local minimum. Without abandoning simulated annealing a manner to
solve a case with local minimum is to run the program several times, each with a different
starting point, and keeping the best answer. The best answer is not guaranteed to be the
best possible answer.
In the previous sections, simulated annealing relies on the ability to slightly alter values
of the vector elements. Thus, some numbers would be changed by a small percent. A
value of 1.4 could become 1.5. Some data is not stored as numerical values but instead is
stored as textual data. DNA, for instance, is stored as a string of data. It is not possible
to slightly change the letter ’A’ to something else and so the simulated annealing process
must be altered.
Instead of changing single elements, the textual version will swap letters. For example the
proposed solution string could be
ABCDEFGH.
EBCDAFGH
as a possible solution.
The RandomSwap shown in Code 22.9 performs this swap of two letters in a string
of any length. The string is the input a and the length N is computed in Line 3. Line 4
creates a list of random integers of from 0 to N − 1. This list is shuffled in Line 5 and the
first two integers are used to indicate which two letters get swapped which occurs in the
last three lines.
Consider a very simple example of rearranging the letters of the input string to match a
given pattern. The purpose of this example is to simply show how the algorithm works.
A real application would be more complicated but the ideas and steps would be about the
same.
309
Code 22.9 The RandomSwap function.
1 # simann3 . py
2 def RandomSwap ( a ) :
3 N = len ( a )
4 r = list ( range ( N ) )
5 np . random . shuffle ( r )
6 q = a [ r [0]]
7 a [ r [0]] = a [ r [1]]
8 a [ r [1]] = q
The first necessity is a cost function. A very simple one is shown in Code 22.10.
The input is the query string and the cost is the number of letters that are different
than the target string. Obviously, this is a stupid task, but it should be evident that
this cost function can be replaced by a more complicated cost function that is designed in
accordance with the user’s application. The function in Code 22.10 returns the number of
mismatched letters between the query and the target and a perfect match would produce
a cost of 0.
Now the simulated annealing process is ready to be employed. The driver function,
AlphaAnn, is shown in Code 22.11. Lines 3 through 5 create a string with a random
arrangement of the 26 letters. The annealing process begins in Line 9 where newguess
is the proposed query. Two letters are swapped in Line 10 and the cost of this proposed
string is computed in Line 11. If the cost is better then the newguess becomes the guess.
The iterations continue until the cost falls below 0.1 which in this case only occurs when
there is a perfect match.
Code 22.12 shows the call to AlphaAnn and the results. The output string has
become the target string.
The program has two default values as inputs and that current configuration will
not produce the correct answer. So, the call in line 2 increases the initial cost. It is also
possible to slow the temperature decay by increasing scaltmp to a value such as 0.9999.
310
Code 22.11 The AlphaAnn function.
1 # simann3 . py
2 def AlphaAnn ( temp =1.0 , scaltmp =0.99 ) :
3 abet = ' a b c d e f g h i j k l m n o p q r s t u v w x y z '
4 guess = list ( abet )
5 np . random . shuffle ( guess )
6 ok = True
7 cost = 99999
8 while ok :
9 newguess = copy . copy ( guess )
10 RandomSwap ( newguess )
11 gcost = CostFunction ( newguess )
12 if gcost < cost :
13 cost = gcost
14 guess = copy . copy ( newguess )
15 temp *= scaltmp
16 if cost < 0.1 or temp <0.01:
17 ok = False
18 return guess
311
Simulated annealing may need to be run several times with different parameters to find
the best solution. This is a common practice.
This section presents a more realistic problem using the same ideas used in AlphaAnn. In
this new case there are several similar protein strings and the task is to find the consensus
string. The consensus string is the one string that best aligns with the set of input strings.
In this task there will be several protein strings, {xi , i = 0, ..., N − 1} and a single
query string q. The idea is to find the q that best aligns with all of the x’s.
This example will use the BLOSUM50 matrix and the BlosumScore function from
the blosum.py module to score the comparison between pairs of amino acids. Thus, the
score for a single acid in q is the sum of the scores of that amino acid compared to the
amino acids in the same position in all of the x strings.
There is no guarantee that the two strings will have the same length and so Line
4 finds the length of the shortest string. Line 5 beings the consideration of each pair of
letters. Lines 6 an 7 find the row and column number associated with the two letters and
Line 8 retrieves the value from the BLOSUM50 matrix. The scores are summed in sc
and divided by the length of the shortest string to compute the final score. An example
is shown in Code 22.13.
Code 22.13 An alignment score.
It will be necessary to compare a q string to several x vectors and the cost function
will be computed from all of them. Unfortunately, the cost function is not straightforward.
The BlosumScore computes a score which is better if the number is larger. The cost
function prefers the opposite where a lower number is better. In this case the score
is subtracted from a large number, based on the length of the string, so that a better
alignment produces a lower value and this is used for the cost.
Code 22.14 shows the function CostFunction which receives four arguments. The
seqs is a list of strings which are the x strings. The query is the q string. The mat and
abet are the substitution matrix and associated alphabet. The score each alignment is
computed and negative of this accumulated in a variable named cost. This negative value
is add to a large number on the last line. The largest value in BLOSUM 50 is 15 and so
312
the max score that can be achieved is 15 × L where L is the length of the query string.
This max alignment only occurs if both strings are filled with ‘W’s. Code 22.15
shows two examples. In the first there are two x strings and the query is not well matched
to either. The cost is computed to be 301.9. The second example starts in Line 5 in which
the q is changed to be the second string. As seen the cost this time is only 285.4. However,
the cost is not close to 0. In this problem the minimum cost is sought but it will not be
close to 0.
This test uses four x strings and the goal is to find the q string that best aligns
with all of them. Initially, the q will be a random string from the amino acid alphabet
and simulated annealing will be used to find the best q. Code 22.16 shown in Code 22.16
creates the four x strings. These are similar to each other but not perfectly matched.
This case is different than the one in Section 22.7.2. In that case there were 26
letters and each one could be used just once. In this case the letters are used multiple
times. So, it is not necessary to swap the letters as the annealing process can just simply
change a letter to another one in the alphabet. This process is performed in Code 22.17
which shows the RandomLetter function. Line 3 gets a random number between 0 and
20 (the length of the alphabet). This variable r is then used in Line 5 to get a single
random letter from the alphabet. Lines 6 and 7 find a random location in the query string
and Line 9 replaces the letter at that location with the random letter from Line 5. The
returned query is the modified string.
The function RunAnn in Code 22.18 performs the simulated annealing . The inputs
313
Code 22.16 The TestData function.
1 # simann4 . py
2 def TestData () :
3 seqs = []
4 seqs . append ( ' ARNDCQEGHILKMFPSTWYV ')
5 seqs . append ( ' ARNDCQEHHILKMFPSTWYV ')
6 seqs . append ( ' ARNDCQEAHILKMFPSTWYV ')
7 seqs . append ( ' ARNDCQEAHILKMFPSTWYV ')
8 return seqs
314
are the set of sequences, the substitution matrix, the associated alphabet and the optional
arguments of temperature and decay constant. Lines 3 through 7 create the random string
q which is the string to be modified. The iterations begin in Line 10. The new guess is
created and its cost is computed in Line 11. If this cost is lower then the guess becomes
the newguess and the cost takes on the lower value. The process continues until one of
the conditions in Line 17 is met. The output is a string that best aligns with all of the
strings in the original set.
The example is executed in Code 22.19. Line 1 generates the best guess which is
returned as a list of single characters. Line 2 converts this to string and it is printed
on Line 3. Since the input data in seqs was strings that were highly similar then it is
expected that the guess string would also be similar each one of the input strings. The
first input string is printed in Line 5 and as seen there are strong similarities.
315
The intent of the algorithm is to provide the sequence that best aligns with a set of
strings. Clearly the algorithm has put forth a viable candidate but it can not be stated
that this is the very best possible string. Often it is the case that this statement is not
possible to make and the user must understand that they have computed a very good
answer but it may not be the best.
One method of securing confidence in the answer is to run the simulation several
times. Since there is a random start to the process it is possible that different answers
can be produced. For this simulation Line 1 in Code 22.19 was repeated 10 times. In
all 10 trials the answer was example the same as Line 3 in Code 22.19. While this does
not prove that this is the best string, it does add confidence that this is one of the best
possible strings.
Problems
1. Given two data points at (0,0) and (1,0) respectively. Use simulated annealing to
find a point that is a distance of 1.0 from both of the data points. Repeat several
times with different seeds. How many solutions are there?
2. Given two data points at (0,0) and (1,0) respectively. Use simulated annealing to
find a point that is equidistant between the two data points although the length of
that distance has no restriction. Run several times with different seeds. Does the
algorithm repeatedly produce the same result?
4. The previous problem should have a very good answer. If the number of random
vectors increases to 4 then it is highly likely that a perfect answer is not possible.
Use simulated annealing to find the best answer.
5. Repeat the previous problem for several cases in which the number of input vectors
is 3, 4, 5, 6, ..., 10. Plot the cost of each trial’s final answer versus the number of
vectors.
316
Chapter 23
Genetic Algorithms
Cases arise in which there is plenty of data generated but the optimal function that
could simulate the data is not known. For example, protein sequences are known and the
protein folding structures are also known, but missing is the knowledge of the function
that converts a protein sequence into a folded structure. This is a very difficult problem
with no easy solution. However, it illustrates the idea that plenty of data can be available
without knowing the exact function that associates them.
An approach to problems of this nature is to optimize a function through the use
of an artificial genetic algorithm (GA) . The idea of this system is that the GA contains
several genes each one encoding a solution to the problem. Some solutions are better than
others. New genes are generated from the old ones with the better solutions being more
likely to be used to create new solutions. The new generation of solutions should be better
than the previous and the process continues until a solution is reached or optimization has
ceased.
Genetic algorithms are quite easy to employ and provide good solutions to tough
problems. The downside to GAs is that they can be quite expensive to run on a computer.
Before delving into the GAs it is first worthy to explore a simpler optimization scheme
that naturally leads into GAs.
Both simulated annealing and GA procedures attempt find a minimum in an energy surface
but in different ways. Figure 23.1 shows a simple view of an energy surface which can also
be considered as an error surface. The ball indicates the position of a solution and the
error that accompanies this solution is the y-value. The purpose of optimization is to find
the solution at the bottom of the deepest well.
In the case of simulated annealing there is a single solution that is randomly located
(since the initial target vector is random). Variations of this vector move this solution to a
317
Figure 23.1: A simple view of an energy surface.
different location. Of course, large variations equate to large displacements of the solution.
As the temperature is cooled the solution can not move around as much and eventually
gets trapped in a well, and further optimization moves the solution down towards the
bottom of the well. Of course, there is no guarantee that the solution will fall into the
correct well. The term “caught in a local minimum” is used to describe a solution that is
stuck in a well that is not the deepest.
The GA approach is different in that there are several solutions. This is similar to
placing several balls on the energy surface. The GA has a two step optimization process.
First new solutions are created from old solutions which is equivalent to replacing balls on
the surface with a new set in which the likelihood is that the newer balls will be closer to
the bottom of the wells. The second step moves the balls slightly through a mutation step.
The optimization occurs mostly be creating a new set of solutions that is better than the
old set of solutions.
23.3 A Numerical GA
This section considers the case of applying a genetic algorithm to numerical data.
318
23.3.1 Initializing the Genes
The GA genes can be generated in several ways. Usually, the initial genes provide very
poor solutions, but that will change as the algorithm progresses. Two common methods
of generating the genes are:
Random vectors are just vectors with random values. However, the range of the
random values should match the range of the data values. If the values in the data vectors
range between -0.01 and +0.01 then the random values in the initial GA vectors should
also be in that range.
The second choice is to select random vectors from the data set. The advantage as
that these initial vectors will be in the same range as the data vectors. The disadvantage
is that the selected starting vectors may be similar by coincidence and data vectors that
are very dissimilar to the chosen ones are not represented. This is not a devastating
disadvantage though.
The GA has several GA genes and the performance of these need to be evaluated by
either a scoring function or a cost function. A scoring function will produce a larger
value for better performance whereas a cost function will produce a lower value for better
performance. The advantage of a cost function is that perfection is a cost of 0, but there
is no single value that is the perfect score for all applications.
The cost function depends on the application. For example, if the purpose of the
GA is to find a sequence that best aligns with several sequences then the cost function
would measure the mismatches between the GA gene sequence and the other sequences in
the input. So, this function is written anew for each deployment of the GA.
The example of finding a vector that is perpendicular to others is repeated in this
chapter except that a GA is used to find the solution instead of simulated annealing. The
first step is to create the cost function. This function should return a cost of 0 if the input
vector is perpendicular to all vectors in a set. The dot product can be used to measure if
two vectors are perpendicular. If ~a ⊥ ~b then ~a · ~b = 0.
This is actually a very easy cost function to create. Consider two matrices both of
which are created from vectors which are stored as rows. The matrix-matrix multiplication
of the first matrix and the transpose of the second matrix computes the dot product of
all possible pairs of vectors. Thus, the cost function for this application requires only two
lines of Python script as shown in Code 23.1. Line 3 computes all of the dot products and
line 4 computes the sum of their absolute values. The output is a vector where each value
is associated with one of the GA genes. Thus, if there are 8 GA genes then the output
319
will have 8 values. If any of the values is 0 then the associated GA gene has provided the
perfect solution.
Creating the next generation of solutions is a bit involved. The number of offspring is
usually equal to the number of parents and the offspring are generated in pairs. Thus
for each iteration two parents are chosen along with a splice point. The splice point is
a random location in the vectors and the first child is created from the first part of the
one parent and the second part of the other parent. The second child is created from
the second parts as shown in Figure 23.2. The parents are selected based upon their cost
functions such that the parents with a lower cost have a better chance of being selected
as one that will help generate the pair of children.
The creation of the children vectors is performed by the CrossOver function which
is not shown due to its length. The inputs are the GA genes and the costs of each of them.
Code 23.2 shows the use of this function. Line 1 generates the data, which in this case
are five vectors in R6 . Thus, it should be possible to find one vector that is perpendicular
to these five. Line 2 creates the GA genes. These are candidates and if one of them is
perpendicular to all of the vectors in data then a solution is found. Of course, the genes
are generated with random values and so none of these should provide a good solution.
This example only creates four such genes, but usually in an application there are many
more.
The costs are computed in line 3 and shown in line 5. Of course, none of these are
near 0. Line 6 uses the CrossOver function to create the next generation of GA genes.
The variable kids is a list of vectors which are converted to a matrix in line 7. The reason
320
Code 23.2 Employing the CrossOver function.
that kids is returned as a list is that this function needs to be useful for the cases in which
the GA is manipulate non-numeric data such as in the case of finding the best aligning
string.
The costs of the kids are computed in line 8 and shown in line 10. It is expected
that some of the kids are better than any of the parents. This is seen to be true as two
of the kids have a lower cost than any of the parents. This process can be repeated as
shown in Code 23.3 and as seen the cost is even lower. However, the cost may not go to
zero and thus another step is needed.
23.3.4 Mutation
In the previous case there were 4 GA genes and the children were created by mixing and
matching parts of the parents. The children, however, can not obtain any value that does
not come from a parent. Line 2 in in Code 23.4 shows the first elements of the four GA
genes. Line 4 shows the first four elements of the four children genes. As seen the values
from the children came directly from the parents. It is not possible for a child gene to
have a value other than those from the parents.
The Mutation function will change some of the values in the GA genes so that
321
Code 23.4 The first elements.
1 >>> genes [: ,0]
2 array ([ 0.64494945 , 0.13895447 , 0.86637429 , 0.05408412])
3 >>> kids [: ,0]
4 array ([ 0.86637429 , 0.86637429 , 0.86637429 , 0.64494945])
values other than those from the parents can be obtained. Usually, only 5% of the values
are changed. Thus, for the case of 4 vectors with a dimension of 6 only one of the elements
will be changed. This function will find the maximum and minimum of the elements values
(in the previous case the max is 0.866 and the min is 0.054) and then expand that range by
a small amount. The newly generated value is a random number from within this range.
The reason that the range is expanded a small bit is that the perfect answer may be a
value that is lower or higher than all of the values in the parent genes. For example, if the
perfect value for the first element in the answer vector is 0.9 then the mutation process
needs to be able to generate a random number that is larger than any of the current values
in the GA genes.
The percentage of elements that can be changed in this process can be altered.
Usually, for numerical data a change of 5% of the total number of elements is acceptable.
If this value is too high then the GA algorithm does not benefit enough from the crossover
and if the value is too low then finding the correct solution can be a very long process.
All of the components are in place and so it is possible to run the GA algorithm. Code
23.5 shows the function DriveGA which shows a typical run. The input are a set of
vectors and the goal is to find a single vector that is perpendicular to all of these. Thus,
the number of vectors needs to be one less than the dimensions. The other inputs are the
number of GA genes, the dimension of those genes and a tolerance.
The random GA genes are created in line 3 and their cost is computed in line 4.
The children and their costs are computed in lines 7 through 9. A mutation is enforced
and the new costs are determined. The best cost and location of that cost are collected
in lines 13 and 14 and if one of the GA genes has a cost that is below the tolerance then
the program terminates and returns that best GA gene.
A single run is shown in Code 23.6. The input contains two vectors of which the
answer is known. These are vectors pointing in the x and y directions in three dimensions
and so the answer should point in the z direction and as seen this is true within the
specified tolerance.
This answer can be confirmed, however the cost function is expecting a matrix for
the genes input. So, line 5 converts the vector best into a single row matrix. This is a
suitable input for the CostFunction function and as see in line 7 the cost of this gene is
322
Code 23.5 The DriveGA function.
1 # ga . py
2 def DriveGA ( vecs , NG =10 , DM =10 , tol =0.1 ) :
3 folks = np . random . ranf (( NG , DM ) )
4 fcost = CostFunction ( vecs , folks )
5 ok = 1
6 while ok :
7 kids = CrossOver ( folks , fcost )
8 kids = np . array ( kids )
9 kcost = CostFunction ( vecs , kids )
10 folks = kids + 0
11 Mutate ( folks , 0.05 )
12 fcost = CostFunction ( vecs , folks )
13 best = fcost . min ()
14 besti = fcost . argmin ()
15 if best < tol :
16 ok = 0
17 return folks [ besti ]
323
below the tolerance.
23.4 Non-Numerical GA
In the previous example the genes in the GA were numerical vectors. There are cases,
especially in bioinformatics, in which the information being manipulated is based on letters
instead of numbers. The GA is a flexible approach and allows for adaptations to suit the
idea of the GA to particular applications. To demonstrate this by example the small
problem of sorting data will be used.
In this problem the goal of the GA is to sort letters of the alphabet. This will use
a trivial cost function of matching a sequence from a gene to a target sequence. More
complicated applications will require a more complicated cost function, but the rest of
this section should be usable without significant alteration.
Before the GA can be applied to text data it is important to review some of the methods by
which strings are manipulated in Python. First, the lowercase alphabet can be retrieved
by typing it directly or using the ascii lowercase function from the string module. Line
2 in Code 23.7 retrieves this string and converts it to a list of individual letters. This
conversion is necessary since it is not possible to change the contents of a string directly.
The GA will need to start with several random genes. In the numerical case this was
a vector of random values. In the text case it will need to be a string of randomly arranged
letters. Each GA gene will need to have all 26 letters but arranged in a different order.
Line 3 is an empty list that will eventually contain these randomly arranged alphabets.
Line 4 creates a duplicate alphabet named ape. This is a list of single characters and
not a string. This list can be rearranged using the shuffle function from the np.random
module. The contents of ape are rearranged and this list is appended in line 6. Note that
the copy function is used to create a duplicate of this list. If folks.append(ape) is used
324
then each entry in the list will be a list in the same place in memory instead of individual
strings. Each time that ape is changed all of the lists inside of folks will also be changed.
The use of copy.copy creates a wholly different arrangement of the letters and appends
it to folks.
The Jumble function shown in Code 23.8 creates the random strings. The inputs
are the alphabet, which in this case is all lowercase letters. However, this function is
adaptable to other applications. For example, if a random DNA strings are desired then
abet is a list of the four DNA letters. The variable ngenes is the number of genes desired.
For GA applications this should be an even number since the child genes are created in
pairs.
Code 23.9 calls the Jumble function and demonstrates that the GA genes are
different from each other. Line 5 converts the list of characters back to a string for
easy viewing. The join function is reviewed in Code 6.38.
Every application of the GA algorithm requires a unique cost function. In this application
the goal is to create a string that is sorted in alphabetical order. This is a very simple
application, but the goal is to demonstrate how the functions are used instead of generating
new, previously unknown answers. The CostFunction function shown in Code 23.1 shows
325
this simplistic cost function. Basically, it compares every list of characters in genes to the
target and counts the number of mismatches. Thus, a perfect cost is 0 and the absolute
worst cost is 26. As seen in the last lines, random strings have a high cost, but this is
expected.
The CrossOver function is capable of creating children genes for either numerical or
textual data. Therefore, a new CrossOver function is not required. However, the
CrossOver function will produce some children that are undesirable for this particu-
lar application. In this project it is required that all strings have each character only once.
The CrossOver function will produce children genes that violate this rule. Therefore,
it is necessary to create a new function that will ensure that all of the children have the
requisite alphabet.
Code 23.11 shows the Legalize function that ensures that all genes have each letter.
The inputs are the valid letters and a single GA gene. So, if there are 10 GA genes this
function will need to be called 10 times. Lines 6 and 7 count the number of times each
letter occurs in the gene. If the gene were legal then this count would be 1 for all letters.
However, if a letter is duplicated then another letter is missing. So, lines 8 and 9 get
the indexes of this missing letters and the duplicate letters. For example, if the letter ‘a’
occurs twice and the letter ‘c’ is missing then mssg would a list with the single entry 2
326
because valid[2] is the letter ‘c’. Likewise, the list of duplicates, dups, would have a
single entry 0 indicating that it is the first letter in valid that is duplicated. If the gene
has more letters that are duplicated and missing then the lists mssg and dups would be
longer.
1 # gasort . py
2 def Legalize ( valid , gene ) :
3 LV , LG = len ( valid ) , len ( gene )
4 cnts = np . zeros ( LV , int )
5 lgene = list ( gene )
6 for i in range ( LV ) :
7 cnts [ i ] = lgene . count ( valid [ i ] )
8 mssg = np . nonzero ( cnts ==0 ) [0]
9 dups = np . nonzero ( cnts ==2 ) [0]
10 np . random . shuffle ( dups )
11 for i in range ( len ( mssg ) ) :
12 k1 = lgene . index ( valid [ dups [ i ]] )
13 k2 = lgene . index ( valid [ dups [ i ]] , k1 +1 )
14 if np . random . rand () > 0.5:
15 me = k1
16 else :
17 me = k2
18 gene [ me ] = valid [ mssg [ i ]]
19
The for loop starting in line 11 contains the process of replacing the duplicates with
the missing. The duplicate list is rearranged in line 10. The variables k1 and k2 are the
indexes of duplicates. Lines 14 through 17 determines which one of the duplicates is to be
replaced, and line 18 performs the replacement.
Two tests are shown in this Code. Line 20 creates a test string in which the letter
‘c’ is missing and the letter ‘a’ is duplicated. The first test shows the result in line 23
which returns replaces the ‘a’ with a ‘c’. However, the selection is a random process. The
second test shows that either ‘a’ can be replaced.
327
Code 23.12 shows the use of the Legalize function. The children genes are created
in line 1 and each is sent to the Legalize function to ensure that all letters exist in each
gene. The cost of the children can be computed and as expected the costs are slightly
lower.
23.4.4 Mutation
The Mutation function also has to be changed. In the numerical case the mutation was
to alter the numerical values. In this case, the mutation is to swap the position of two
letters.
A simple mutation function is shown in Code 23.13 . Simple random locations are
selected to swap letters.
328
23.4.5 Running the GA for Text Data
All of the components are in place and so it is possible to run this example test. The
DriveSortGA function shown in Code 23.14 performs the complete task. This follows
the same protocol as the numerical case with the inclusion of the Legalize function.
The final lines show the call to the function and the ensuing results. As seen this
GA has performed the simple task of sorting letters alphabetically. In a real application
the cost function would be replaced to accommodate the user’s task, but the steps shown
in DriveSortGA would be the same.
23.5 Summary
Machine learning encompasses a field in which the program attempts to train on the
data at hand. There are several algorithms in this field and one that is widely used in
bioinformatics is the Genetic Algorithm (GA). The GA contains a set of data genes (which
329
can be vectors, strings, etc.) and through several iterations attempts to modify the genes
to provide an optimal solution. This requires the user to define the metric for measuring
the optimal solution. The unique quality of a GA is that new genes are constructed by
mating old genes. New genes are generated from copying parts of the older genes. GAs
tend to use many iterations and can be quite costly to run. However, they can provide
solutions that are more difficult to get using simpler methods.
Problems
1. Create a GA that starts with random DNA strings of length N . Create a cost
function such that the GA will compute the complement of a DNA target string.
2. Given a 8 random vectors of length 9. Create a GA program that will find the vector
that is perpendicular to the original 8 that also has a length of 1.
3. Consider the parody problem in which the training data is (000:0), (001:1), (010:1),
(011:0), (100:1), (101:0), (110:1), and (111:0) where (x1 x2 x3 : y) is a three dimen-
sional input and its associated one dimensional output. Create a GA that determines
the coefficients a, b, c, d, e, f, g, h in the function y 0 = ax1 x2 + x3 + bx1 x2 + cx1 x3 +
dx2 x3 + ex1 + f x2 + gx3 + h.
4. Given the same data as in the previous problem, create a GA that determines the
coefficients for the function z = Γ(ax1 + bx2 + cx3 ) and y 0 = dx1 + ex2 + f x3 + gz,
where Γ(w) = 1 if w > 0.5 and is 0 otherwise.
5. Create a GA that creates a consensus sequence in which the cost is twice as low if
the GA gene is one of the original training sequences. Use the data from Code 22.16.
330
Chapter 24
Aligning two sequences is relatively straight forward. Aligning multiple sequences adds a
new complication and there are two types of approaches. The greedy approach attempts
to find the best pairs of sequences that align and to build on those alignments. The non-
greedy approach attempts to find the best group of alignments. The advantages of the
greedy approach are that the programming is not too complicated and this system runs
fast. The advantage of the non-greedy system is that the performance is usually better.
Figure 24.1 is a standard depiction of multiple sequence alignment. There are four se-
quences labeled A, B, C and D. Each one has an associated arrow. Any arrow pointing
to the left means that the complement of the sequence is being used. The position of the
arrows shows the shift necessary to make them align.
There are two issues that need to be raised. The first is that some alignments have
disagreements and the second is the issue of using complements.
Consider a case in which sequences A and B are aligned as shown and that this
alignment has good agreement. In the overlapping regions the letters in A match with the
letters in B. Now, consider the cases of aligning B and C. Again, in the overlapping regions
the two sequences are in agreement. However, there is no guarantee that the segment of A
and C that overlap without also overlapping B are in agreement. Since there are repeating
and similar sequences throughout a genome, this type of problem is possible.
331
The second issue is that of complements. In the rest of this chapter complements
will not be used because that would unnecessarily complicate the discussion on alignment
multiple sequences. However, in many applications it is necessary to consider the comple-
ment. In these cases the sequencing machine can provide a sequence but does not indicate
on which DNA strand it resides. Therefore, it is necessary to consider aligning a sequence
or its complement. Once one of these is used the other needs to be removed from further
consideration.
Two types of algorithms will be considered here. The first is a greedy approach and the
second, in Section 24.3 is the non-greedy approach.
In the greedy approach the algorithm will consider all alignment pairs and begin
building the assembly from the best pairs. This approach is faster and less accurate than
the non-greedy algorithm. The best alignments will be joined together to create a contig
which is a contiguous string of DNA.
It is possible that multiple contigs will need to be created during the construction of
the assembly. Consider the alignment shown in Figure 24.2. Sequences A and B strongly
align and create a contig. The next best align is C with D. These create a different contig.
The third best alignment is B with C. This can be used to join the two contigs to create
a single assembly as shown.
The greedy approach starts with a comparison of all pairs of sequences. If we had
four sequences then we would compute the following alignments (s1, s2), (s1, s3), (s1, s4),
(s2, s3), (s2, s4), and (s3, s4). This information can be contained into a triagonal matrix,
M,
0 s1, s2 s1, s3 s1, s3
0 0 s2, s3 s2, s4
M =
0
. (24.1)
0 0 s3, s4
0 0 0 0
Each element of M keeps the score of the alignment of two sequences. Assuming that a
large score indicates a better match we can find the best of all possible pairings by finding
the largest value in the matrix.
332
24.2.1 Data
The data used in the examples must have the property of overlapping subsequences. For
now these overlaps will be perfect and the sequences will not have gaps. The function
ChopSeq shown in Code 24.1 receives an input sequence and the desired number of
subsequences and the length of these subsequences of which all with have the same length.
Most of the segments will be selected at random and so there is no guarantee that the
first or last part of the input sequence will be included. So lines 4 and 5 put the first
and last part of the input sequence into the list segs. The variable laststart is the last
location in the sequence were a segment can begin. Any location after that will produce a
shorter segment because it has reached the end of the input sequence. The for loop then
extracts segments at random locations. There is no guarantee that every element in the
input sequence will be included in the segments.
1 # aligngreedy . py
2 def ChopSeq ( inseq , nsegs , length ) :
3 segs = []
4 segs . append ( inseq [: length ] )
5 segs . append ( inseq [-length :] )
6 laststart = len ( inseq ) - length
7 for i in range ( nsegs-2 ) :
8 r = int ( np . random . rand () * laststart )
9 segs . append ( inseq [ r : r + length ] )
10 return segs
Code 24.2 shows the use of this function. The sequence is created in line 2 and the
segments are extracted in line 3. This will create 10 sequences each of length 8. The rest
show that the first two sequences are the beginning and ending of the initial sequence and
the rest are from random locations.
The number of sequences and their lengths depends on the sampling that one desires.
Usually, the minimum is 3-fold sequencing which means that each element in the input
sequence should appear on average in three segments. Of course, with random selection
some elements will appear in more. In this case the input sequence is 26 elements long. If
3-fold sequencing is desired then the output should have a total of 26 × 3 = 78 elements.
The desire is that each segment have a length of 8 so 10 sequences will be required since
78/8 = 9.75. Better performance is achieved if the value of n in n-fold sequencing is
increased. If the desire is to have 6-fold sequencing then 20 segments of length 8 will be
needed.
The final comment on data is that each segment will need an identifier. In real
applications this could be the name of the gene in the sequence or some name that identifies
which experiment produced the data. In this case, the data will be faked and therefore
333
Code 24.2 Using the ChopSeq function.
the names of the sequences will simply be ’s0’, ’s1’, ’s2’, etc.
In the greedy approach pairs of alignments will be considered. Consider a single pair which
has two sequences designated as sa and sb. The matrix M is used to determine which
sequences are to be aligned. The maximum value in M corresponds to two sequences and
these are then considered to be sa and sb.
There are four choices which are listed below.
Initially, there are no contigs and so only the first choice is possible. Then as other
pairs of alignments are considered the other choices come into play.
The process repeats until all elements in M that are above a user specified threshold
are considered. There is no guarantee that all of the contigs will be joined together. It
is possible that at least one element in the input string does not appear in any segment.
In that case the two contigs will not overlap and so the final assembly includes multiple
contigs.
This example follows all of the steps necessary to make an assembly using an amino acid
string. There are several functions here are which are not explored in detail but rather are
334
just discussed and then used. Readers interested in how the functions work are invited to
explore the functions in aligngreedy.py.
This example is divided into sections. First there is the collection of the data, second
is the computation of pairwise alignments, third is the creation of initial contigs, fourth
is the process of adding sequences to existing contigs, fifth is the joining of contigs and
finally there is a driver function that can be called to create an assembly.
24.2.3.1 Data
For this example a protein from a bacteria is used. Code 24.3 shows the necessary steps.
The file is read in line 3 and one of the proteins is extracted in line 5. This particular
protein has 185 amino acids.
The next step is to chop of this sequence into subsequences such that overlaps are
common. Line 2 in Code 24.4 creates 8 substrings that are 50 characters long. Thus, each
string is about one-fourth of the original string. This is an uncommonly long segment for
such a string, but it facilitates the discussions of the example. Line 1 sets an initial random
seed which is used only if the reader wishes to replicate the results in the following codes.
If this line is changed or eliminated then the cut locations in creating the substrings will
be different and the results will not mirror the following examples.
The greedy approach relies on the pairwise alignments of the sequences. Thus, all possible
pairs are aligned and scored. For each alignment there are two values that are kept.
335
The first is the score of the alignment and the second is the shift required to make this
alignment. These values are returned as two matrices.
The FastMat function is used in line 1 of Code 24.5 to compute the alignment
of all possible pairs. Since there are 8 sequences the returned matrix is 8 × 8. The
matrix M contains the scores of the alignments using the BruteForceSlide function. It
is not necessary to align a sequence with itself and so the diagonal elements are 0. The
alignment score for sequence A with sequence B is the same as sequence B with sequence
A and therefore only half of the matrix is populated. As seen some of the scores are quite
high (above 90) and many are very low. The sequence pairs that had overlap create a
high score and those that had no overlap create low scores. The user must decide what is
a valid alignment which is the same as setting a threshold of acceptance. If the threshold
is too high then sequences with some overlap will be discarded. If the threshold is too low
then the program will align sequences with bad matches. The threshold value is dependent
on the sequence length, the scoring algorithm and the substitution matrix that is used.
Commonly, a threshold of less than half of the maximum is sufficient. In this case the
threshold is set at γ = 50. It should be noted that the selection of the threshold is not
critical. The same results can be obtained with a higher threshold.
It will be necessary to align pairs of sequences as the contigs are constructed. Thus,
it is prudent to store the shifts required to achieve the alignment scores. These are stored
in matrix L of which a few of the values are shown here. These will be used later.
336
24.2.3.3 Initial Contigs
The assembly will consist of one or more contigs. In Python the assembly will be a list
of contigs. Each contig is itself a list which contains information about each string in the
contig. Each of these representations is a list of two items: the string name and the shifted
string.
Line 1 in Code 24.6 creates an empty list that will soon be populated. In the greedy
approach the best alignments are considered first. These alignments have the largest values
in the matrix M. Line 2 uses the divmod function to function the location of the largest
value in the matrix (see Code 11.23). In this example, the largest value is at M[2,4] which
indicates that the sequences segs[2] and segs[4] are the two that align the best in this
data set. The value of M[2,4] is 331 which indeed is the largest value in the matrix.
1 >>> smb = []
2 >>> v , h = divmod ( M . argmax () , 8 )
3 >>> v,h
4 (2 , 4)
The function ShiftedSeqs returns two sequences after alignment. Basically, it puts
the correct number of periods in front of one of the sequences to align them. This correct
number is based on the lengths of the sequences and the shift value stored in the L matrix.
Code 24.7 shows this first alignment. As this is the highest scoring alignment it is expected
that the overlap is significant. As seen in line 6 only one period was required to create the
alignment.
This is the first pair of sequences considered and therefore the only possible action
is to create a new contig. This is accomplished by the NewContig function. Line 1 in
Code 24.8 shows the use of this function. It receives the assembly smb, the two aligned
sequences, and their names. This will create a list with two items. Each of these items is
a list which contains the name and aligned sequence.
The function ShowContigs is used to display the assembly. Currently, the assem-
bly consists of a single contig. If there are more than one contig then a newline will
337
Code 24.8 Using the NewContig function.
separate them in the display. This function shows the first 50 bases in the alignment.
The function can receive a second argument that will start the display at a new location.
Thus, ang.ShowContigs{smb, 90} will show the first 50 bases starting at location 90.
This completes the processing of this best pairwise alignment. The next step is to
consider the second best pairwise alignment. To find this alignment the largest value in
M is replaced with a 0. This is shown in line 1 of Code 24.9. Now the largest element in
M is indicative of the second best pairwise alignment. This value is 312 which is still well
above the threshold of γ = 50. The location of this second best alignment is 1, 3. This
indicates that this the alignment uses two strings that are not yet in the assembly.
1 >>> M [v , h ] = 0
2 >>> M . max ()
3 312
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (1 , 3)
For each pairwise alignment there are four choices as depicted in the bulleted list
in Section 24.2.2. Since neither of the sequences are in any other contigs the choice is to
create a new contig. This is shown in Code 24.10. The aligned sequences are created in
line 1 and placed in a new contig in line 2. Line 3 calls the ShowContigs function which
now displays the two contigs separated by a new line.
7 s1 ......... G K S A A T W C L T L E G L S P G Q W R P L T P W E E N F C Q Q L L T G N P N G P
8 s3 TGRSPQQGKGKSAATWCLTLEGLSPGQWRPLTPWEENFCQQLLTGNPNGP
338
24.2.3.4 Adding to a Contig
The process repeats as shown in Code 24.11. Line 1 removes the largest value and the
next largest value is found to be above the threshold. The location is at 0,2. In this case
one of the sequences is already in a contig and so a new decision is required. Instead of
creating a new contig the task is to add segs[0] to the contig that contains segs[2].
There are a couple of steps required to do this. First the location of segs[2] is required.
It will be required to know the location and which contig this sequence is in. After that
then the alignment of the two sequences will have to be adjusted to also align with the
sequence currently in the contig.
1 >>> M [v , h ] = 0
2 >>> M . max ()
3 260
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (0 , 2)
7 >>> sa , sb = ang . ShiftedSeqs ( segs [ v ] , segs [ h ] , L [v , h ] )
8 >>> ang . Finder ( smb , ids [2] )
9 (0 , 0)
The Finder function determines the location of the sequence within a contig. This
is shown in lines 8 and 9. This function returned two values. The first is the contig and
the second is the location in the contig. In this case, segs[2] is located in the first contig
and is the first sequence in that contig.
The next step is to add segs[0] to the first contig. In order to do this the sequence
sb (which is segs[2] aligned with segs[0]) must be synchronized with the sequence
segs[2] which is in the contig. In this case, several periods are required at the beginning
of sb in order to align it with sa. The sb is segs[2] with prefix periods and the first
sequence in the contig is segs[2] without any prefix periods. In order to align segs[0]
with all sequences in the first contig it will be necessary to insert prefix periods to all items
in the first contig such that sb aligns with the first sequence.
Code 24.12 shows this process. The Add2Contig function is called in line 1. This
receives the assembly, the sequence that is already in a contig, the sequence that is to be
added to the contig, the name of that sequence, and the two values returned by Finder.
This will align the new sequence with the contig. As seen in the display it was necessary
to add several periods in front of all of the items previous in the first contig in order to
align it with the new sequence.
In this case all of the items in the needed prefix periods. There is a second case that
also has to be considered by the Add2Contig function. In the future it may be necessary
to add a new sequence to this contig because it aligned with segs[4]. The sequence in the
339
Code 24.12 Using the Add2Contig function.
7 s1 ......... G K S A A T W C L T L E G L S P G Q W R P L T P W E E N F C Q Q L L T G N P N G P
8 s3 TGRSPQQGKGKSAATWCLTLEGLSPGQWRPLTPWEENFCQQLLTGNPNGP
contig already has several prefix periods and these will need to be added to the incoming
sequence to align it with the rest of the contig. The function Add2Contig considers all
of the necessary prefix additions to make the alignment valid.
The next largest value in M indicates that segs[0] aligns will with segs[4]. These
two are already in the same contig and so nothing needs to be done.
1 >>> M [v , h ] = 0
2 >>> M . max ()
3 255
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (0 , 4)
The process continues. New contigs are created or sequences are added to contigs
as necessary. Code 24.14 shows that the next best pairwise alignment is for sequences
segs[5] and segs[6]. Neither of these are in a contig and so a new contig is created.
1 >>> M [v , h ] = 0
2 >>> M . max ()
3 254
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (5 , 6)
7 >>> sa , sb = ang . ShiftedSeqs ( segs [ v ] , segs [ h ] , L [v , h ] )
8 >>> ang . NewContig ( smb , sa , sb , names [ v ] , names [ h ] )
Code 24.15 shows that the next best alignment is for segs[3] and segs[7]. The
segs[3] is already in a contig and the Finder program indicates that it is the second
340
item in the second contig. So, Add2Contig adds segs[7] to this contig.
1 >>> M [v , h ] = 0
2 >>> M . max ()
3 154
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (3 , 7)
7 >>> sa , sb = ang . ShiftedSeqs ( segs [ v ] , segs [ h ] , L [v , h ] )
8 >>> ang . Finder ( smb , ids [3] )
9 (1 , 1)
10 >>> ang . Add2Contig ( smb , sa , sb , ids [ h ] , 1 ,1 )
The call to Add2Contig needs a little attention. The second argument to this
function is the sequence that is already in a contig. This is either of sa or sb. In Code
24.12 this was sb, but in Code 24.15 this is sa. The third argument is the sequence that
is to be added to the contig and the fourth argument is the name of that sequence.
The step in the process is shown in Code 24.16. This pairwise alignment mates segs[5]
and segs[7]. In this case both of these are already in separate contigs. As seen in lines 7
and 8 the segs[5] is in the third contig in the first position. As see in lines 9 and 10 the
segs[7] is in the second contig in the third position.
1 >>> M [v , h ] = 0
2 >>> M . max ()
3 119
4 >>> v , h = divmod ( M . argmax () , 8 )
5 >>> v,h
6 (5 , 7)
7 >>> ang . Finder ( smb , ids [ v ] )
8 (2 , 0)
9 >>> ang . Finder ( smb , ids [ h ] )
10 (1 , 2)
11 >>> sa , sb = ang . ShiftedSeqs ( segs [ v ] , segs [ h ] , L [v , h ] )
The decision here is to join the two contigs. Probably one of the contigs will need to
be shifted to align with the other. This will require that a set of prefix periods be added
341
to all of the sequences in one of the contigs. Once aligned a new contig is created from
both of these contigs and the old contigs are destroyed. Thus, this new contig will be the
last one in the assembly.
This process is shown in Code 24.17. This uses the JoinContigs function. It
receives several arguments. The first is the assembly which will be modified. The next
two are the contig numbers from the returns of the Finder function. The next two are
the locations in those contigs. The final two arguments are the aligned sequences.
7 s5 ............. A K I I T E P D F P P R N P P I R Y R A S I P T S W L S I T L T E G R N R
8 s6 GITFADYPTRPAIAKIITEPDFPPRNPPIRYRASIPTSWLSITLTEGRNR
9 s1 ..................................................
10 s3 ..................................................
11 s7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EGRNR
Line 2 uses the ShowContigs function to show the assembly. This shows only the
first 50 letters in each string. In this case some of the strings have been shifted by more
than 50 spaces and so they appear only as periods. The ShowContigs function also has
a second argument which is the location at which the display show begin. This is shown
in Code 24.18. This display starts at location 40 and so the first 10 elements of each string
should match the last 10 in the previous display. In this window the content of some of
the other strings can be seen.
6 s5 SI T L T E G R N R Q V R R M T A A V G F P T
7 s6 SITLTEGRNR
8 s1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GKSAATWC
9 s3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TG RSPQQG KGKSAA TWC
10 s7 ..... E G R N R Q V R R M T A A V G F P T L R L V R V Q I Q V T G R S P Q Q G K G K S A A T W C
342
24.2.3.6 The Assembly
Now all four possible decisions have been considered. Each time a new pairwise alignment
is considered the choice is to create a new contig, add to a contig, join contigs or do
nothing.
There are a few more housekeeping details necessary to make the full assembly.
This first is that the process should stop once the remaining values of M are below a
user-defined threshold. As seen in this example, the values of each considered alignment
is less than the previous. Eventually, the pairwise alignment that will be considered has
a score that is below the threshold. This should indicate that this pairwise alignment is
poor and should not be considered in the assembly. At this juncture the process should
be completed.
The second item is that there is no guarantee that all of the segments were used in
the alignment. Thus, the use of each item needs to be tracked. Those segments which are
not in any contig still need to be included in the assembly. Each of these sequences are
placed in their own contig and then appended to the end of the assembly.
The function Assemble is shown in Code 24.19. This function receives a list of the
sequences names, the list of the sequences, the substitution matrix and its alphabet, and
the user defined threshold. Line 3 creates a vector of 0’s and the length of this vector is the
number of sequences. When a sequence is placed in a contig the corresponding location in
this vector is set to 1 (line 24). At the end, any 0’s in this vector indicate that a sequence
was not used in the any contig.
Line 4 creates the M and L matrices. The best location in M is found in line 9. The
Finder function is called twice to determine if either sequence is in a contig. If Finder
returns (-1,-1) then the sequence was not in a contig. The sequences are aligned in line
14. Then there are four if statements that consider the possible choices. Lines 18 and 20
both call Add2Contig but the order of the inputs are different. The first one is used if
sa is found in a contig and the second is used if sb is found in a contig. Line 22 is used
if contigs need to be joined. If the current value of M[v,h] is below threshold then the
process exits the while loop. The final part in lines 27 through 29 is to create contigs
with single sequences for all of those sequences that were not used in any contig.
The call to Assemble is shown in Code 24.20. The names and the sequences have
been previously created. In this case, the alignment uses the BLOSUM50 matrix and its
alphabet. The user threshold is set to 59 as explained above. The first 50 elements are
shown. In this case all sequences are used in the assembly.
The module aligngreedy.py has a second function named AssembleML which per-
forms the same task except that the matrices M and L are computed outside of the
function. The reason is that creating these two matrices is by far the most time consum-
ing part of the computation. If the user wishes to try several assemblies (perhaps with
different threshold values) then it is prudent that the time consuming computation not be
repeated.
343
Code 24.19 The Assemble function.
1 # aligngreedy . py
2 def Assemble ( fnms , seqs , submat , abet , gamma = 500 ) :
3 used = np . zeros ( len ( fnms ) )
4 M , L = FastMat ( seqs , submat , abet )
5 ok = 1
6 smb = []
7 nseqs = len ( seqs )
8 while ok :
9 v , h = divmod ( M . argmax () , nseqs )
10 if M [v , h ] >= gamma :
11 vnum , vseqno = Finder ( smb , fnms [ v ] )
12 hnum , hseqno = Finder ( smb , fnms [ h ] )
13 print ( M [v , h ] , v , h )
14 s1 , s2 = ShiftedSeqs ( seqs [ v ] , seqs [ h ] , L [v , h ] )
15 if vnum == -1 and hnum == -1:
16 NewContig ( smb , s1 , s2 , fnms [ v ] , fnms [ h ] )
17 if vnum != -1 and hnum == -1:
18 Add2Contig ( smb , s1 , s2 , fnms [ h ] , vnum ,
vseqno )
19 if vnum == -1 and hnum != -1:
20 Add2Contig ( smb , s2 , s1 , fnms [ v ] , hnum ,
hseqno )
21 if vnum != -1 and hnum != -1 and vnum != hnum :
22 JoinContigs ( smb , vnum , hnum , vseqno , hseqno ,
s1 , s2 )
23 M [v , h ] = 0
24 used [ v ] = used [ h ] = 1
25 else :
26 ok = 0
27 notused = (1-used ) . nonzero () [0]
28 for i in notused :
29 smb . append ( [( fnms [ i ] , seqs [ i ]) ] )
30 return smb
344
Code 24.20 Running the assembly.
The greedy approach is based on finding the best pairs of alignments. While there is
some logic to this approach it does not necessarily find the best alignment. The non-
greedy approach only scores the total alignment and does not attempt to find the best
pairs of alignments. There are many different non-greedy approaches of which only one is
presented here.
The example approach uses a genetic algorithm (GA) to create several sample assem-
blies and then optimizes by creating new assemblies from the best of the older assemblies.
Each gene creates an assembly and each assembly contains multiple contigs. Each contig
is used to generate a consensus sequence. The assembly is converted to a catsequence
which is the concatenation of the consensus sequences. The goal in this case is to find the
assembly that creates the shortest catsequence and thus the cost of the gene is length of
the catsequence that it eventually generates.
The data for this system is generated as in the greedy case. Code 24.21 reviews the
commands needed to generate the data for this section.
The gene for the GA needs to encode a method by which an assembly is created. In
the greedy case the assembly was created by considering pairs of sequence alignments in
order of their alignment score. In the non-greedy case the use of alignment scores for
pairs of sequences is not used. Rather an assembly is created by a random sequence of
alignment pairs. The matrix M contains the scores for the alignments and in this case its
sole purpose is to provide a list of possible alignment pairs, which these are elements in M
which are above a small threshold. Code 24.22 uses the function BestPairs which creates
a list of all elements in M that are above a threshold γ. Each entry in the list is the v, h
345
Code 24.21 The commands for an assembly.
from the M[v,h] locations that qualify. In this case the data generated 90 elements in M
that were above the threshold of 5. The first ten of these are shown.
346
Code 24.23 Showing two parts of the assembly.
1 >>> ids = []
2 >>> for i in range ( 15 ) :
3 ids . append ( ' s ' + str ( i ) )
4 >>> smb = ngd . Gene2Assembly ( range (90) , hits , chops , ids , L )
5 >>> greedy . ShowContigs ( smb )
6 s6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 s14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 s1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 s3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ARYIV
10 s5 . R E E L V R K E I Q L A N I T E F D F C F P T P L F F L N Y F L R I S G Q T Q E S M L F A R Y I V
11 s9 N R E E L V R K E I Q L A N I T E F D F C F P T P L F F L N Y F L R I S G Q T Q E S M L F A R Y I V
12 s2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 s8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14 s12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15 s13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16 s4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17 s0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18 s7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19 s11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20 s10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
347
Figure 24.3: Aligning sequences for a consensus.
CatSeq called in Code 24.25. The next step in this process is to realize that an assembly
contains several contigs, and these do not overlap. For the purposes of scoring the assembly
a single long string is created from all of the contigs. The non-overlapping contigs are
concatenated into sq. The example creates a single string from the assembly generated in
Code 24.23.
A gene is merely an ordering of the sequence pairs used to create an assembly. Code
24.26 creates an instance of a GA class and uses the function InitGA to create random
arrangements of the sequences pairs. In this example each gene is a list of numbers from
0 to 98 in a random arrangement.
The cost of a gene is the length of the consensus sequence that it creates. Code 24.27
shows the function CostAllGenes which considers each gene in the for loop. Each gene
creates and assembly smb which in turns creates a catsequence cseq. The cost of this
sequence is its length. In this example there are 10 genes and the costs they generated are
shown.
The crossover function is not changed from the original GA program. Code 24.28
348
Code 24.26 The InitGA function.
1 >>> import ga
2 # nongreedy . py
3 def InitGA ( pairs , Ngenes ) :
4 # pairs from BestPairs
5 # Ngenes = desired number of GA genes
6 work = np . arange ( len ( pairs ) )
7 genes = []
8 for i in range ( Ngenes ) :
9 np . random . shuffle ( work )
10 genes . append ( copy . deepcopy ( work ) )
11 return genes
12
349
shows the calls to create new genes and to compute their cost. The problem with the new
genes is that they may not contain all of the pairings and they may contain two copies of
other pairings.
1 >>> import ga
2 >>> kids = ga . CrossOver ( folks , fcost )
3 >>> kcost = ngd . CostAllGenes ( kids , hits , chops , ids , L )
4 >>> kcost
5 array ([ 238. , 173. , 173. , 162. , 162. , 177. , 188. ,
235. , 160. , 185.])
A gene should contain each pair of sequences from the original list and these new
genes are not yet correct. The function CostAllGenes considers a gene and finds those
elements that are duplicates and replaces one of the duplicates with one element that is
missing. Each child gene is processed and thus it is necessary to have a loop to process
all children as shown in Code 24.29. Now all GA genes have only one instance of each
pairing. The cost of the children can now be computed.
The mutation stage uses a function named SwapMutate that swaps the elements
in the GA genes much like in the alphabet program above.
All of the parts are now in place to perform the GA. Code 24.30 shows the function
RunGA which drives the GA process. It settles rather quickly on an assembly that
creates a consensus sequence that has a length of 120.
The results of the non-greedy test are compared to the greedy approach. Code 24.31
shows the steps used to create a greedy consensus. The length of the greedy consensus is
267 while the length of the non-greedy approach is only 120. Obviously, the non-greedy
approach significantly outperformed the greedy approach. The cost of this improvement
though is that the non-greedy approaches are usually computationally expensive.
350
Code 24.30 The RunGA function.
1 # nongreedy . py
2 def RunGA ( hits , seqs , seqnames , L ) :
3 NH = len ( hits )
4 folks = InitGA ( hits , 10 )
5 fcost = CostAllGenes ( folks , hits , seqs , seqnames , L )
6 print fcost . min () , fcost . argmin ()
7 for i in range ( 10 ) :
8 kids = ga . CrossOver ( folks , fcost )
9 for i in range ( len ( kids ) ) :
10 kids [ i ] = FixGene ( kids [ i ] , arange ( NH ) )
11 kcost = CostAllGenes ( kids , hits , seqs , seqnames , L )
12 ga . Feud ( folks , kids , fcost , kcost )
13 SwapMutate ( folks , 0.03 )
14 fcost = CostAllGenes ( folks , hits , seqs , seqnames , L
)
15 print fcost . min () , fcost . argmin ()
16 return folks [ fcost . argmin () ]
17
351
24.3.4 Improvements
The non-greedy approach presented is still not the best system and does have a flaw.
Consider the sequences:
1 S 1 = abcdef
2 S 2 = defghi
3 S 3 = jkldef
It is quite possible to align S1 with S2 and then S2 with S3. In doing so the following
assembly is created:
1 abcdef
2 ... defghi
3 jkldef
In this assembly the S1 and S3 do not align all that well. Such problems are likely
to occur when building an assembly from pairs of sequences. An improvement to the
GA program would be to prevent such poor secondary alignments from occurring or to
increase the cost of the assembly if there is a poor consensus.
It is important to note that there is no set method of creating a non-greedy algorithm.
The GA is only one method and as seen it could be modified to behave differently. The
main purpose of the non-greedy approach is to create a system that scores the entire
assembly rather than finding the best matches within it.
24.4 Summary
The previous chapter aligned two sequences. However, many applications require the
alignment of more than two sequences. Multiple sequence alignment can be performed
through two differing philosophies. The first is a greedy approach in which the assembly
is constructed by adding pairs of sequences according to their pair alignment scores. The
non-greedy approach attempts to find the best overall assembly by using machine learning
techniques. This approach does not consider the alignments according to their pairing
scores but rather attempts to optimize the entire alignment. The latter approach is much
more expensive but can provide better results.
Problems
1. Run the greedy assembly with a threshold that is 90% of the maximum value in M.
Interpret the results.
2. Apply the greedy algorithm to English text. Chop written text up into many sub-
sequences and then assemble using the greedy approach. Is this assembly similar to
352
the original?
4. Measure the scale-up effect on computation time. For strings of different sizes com-
pute the assembly and measure the time of computation. Plot the computational
time versus the size of the original data string.
5. Modify the greedy algorithm to handle sequences and their complements. The pro-
gram should note that if a string is used in making a contig then it and its comple-
ment should be removed from further consideration.
6. Is it possible to have an consensus sequence that is shorter than the original sequence?
In this case the original data is completely represented in the sequence segments used
as inputs. Consider a case in which the original sequence has a repeating segment
and that this repeating region is longer than the cut length used when chopping up
the original sequence.
353
354
Chapter 25
Trees
em Trees are a very effective method of organizing data and coursing through data to
find relationships. This chapter will review a few types of dictionaries but again is not an
exhaustive study.
25.1 Dictionary
The dictionary in a word processor does not search the entire English dictionary every
time a new word is typed. That would be a horrendously inefficient process. One approach
is to build search tree to speed up the spell checking process.
The tree is a simple design where there are two basic types of nodes. One type is
an intermediate node which is a letter that is not at the end of a word and the second is
a terminal node which represents the end of a word although not necessarily the end of a
tree branch.
A simple example is to build a tree from the following words:
CAT
CART
COB
COBBLER
These four words are organized in a tree search as shown in Figure 25.1. The shaded
nodes are those which hold the last letter of a word.
355
Figure 25.1: The dictionary tree for the four words.
25.2 Sorting
Given a vector of numbers the search for the maximum value can easily be performed as
shown in Code 25.1. Line 2 creates the data and line 3 sets the variable mx to the first
value in the vector. In the for loop each value is compared to that of mx. If the considered
value is greater than mx then mx takes on this new value as shown in Line 6. Of course, in
the numpy package there already exists a max as shown in Line 9.
356
which is shown starting in line 3. The argsort command is applied in line 5. This returns
the indexes in a sorted order. In this case the first index that is returned is 3 and thus
a[3] is the lowest value in the vector.
Line 8 shows the command to display all of the data in the sorted order. Line 11
shows the sort command which actually rearranges the data in the vector. The original
location of the data is destroyed with this command.
Moving data about in a computer memory is expensive for large amounts of data. Thus,
the concept of a linked list is used to sort the data without moving the data.
In the linked list concept each piece of data also contains an identification and a
link. This is shown in Figure 25.2. In this case there are four pieces of data and for the
example the IDs are 1, 2, 3 and 4 respectively. However, the data is not in a sorted order.
In this example, the last piece of data has the lowest value and the first piece of data has
the next lowest value. Instead of moving the last piece of data the link is changed to point
to the first piece of data.
357
A different example is shown in Figure 25.3. Initially, there are three pieces of data
and they are sorted. Now, a fourth piece of data is added. It is placed at the end of the
data where there is empty memory in the computer. The links are then rearranged as
shown in the lower portion of the image thus indicating the sort order without actually
moving the data.
There are multiple manners in which a linked list can be created in Python. One
approach is to use a dictionary as shown in Code 25.3. An empty dictionary is created in
line 1 and the first data item is placed in line 2. In this scheme the ID is the key in the
dictionary and the tuple contains the data value and the link. In this case the link is -1
indicating that it is not linked to any other data.
1 >>> dct = {}
2 >>> dct [0] = [0.18 ,-1]
3 >>> dct [1] = [0.35 ,-1]
4 >>> dct [0][1] = 1
5 >>> dct [2] = [0.2 ,-1]
6 >>> dct [0][1] = 2
7 >>> dct [2][1] = 1
A second piece of data is created in line 3 and it is also not linked to any other piece
of data. For the data to be in sort order then the first data needs to link to the second
and so in line 4 the link of the first item is changed to the ID of the second item.
A third item is created in line 5 and it is to be inserted between the previous two
items. So, its link and the item that links to it are modified in the final two lines. This is
shown in Figure 25.4.
Once the data is in a linked list then the recall of the data is simple. Code 25.4 starts
with creating an empty list named answ which will collect the data in a sorted order. Line
2 creates the integer k which will keep track of the location in the list. It is initially set to
the first item in the linked list. This is not necessarily the first item in the dictionary. In
the case of sorting data this is the item in the dictionary with the lowest data value. In
the case of Figure 25.4 k=0.
358
Figure 25.4: A linked list.
1 >>> answ = []
2 >>> k = 0
3 >>> while k !=-1:
4 d , k = dct [ k ]
5 print (d , k )
6 answ . append ( d )
7
8 0.18 2
9 0.2 1
10 0.35 -1
11 >>> answ
12 [0.18 , 0.2 , 0.35]
The while loop extracts each piece of data. Line 4 retrieves the data and the link
to the next item. These are printed to the console. Line 6 places the retrieved data into
the answer list. The process continues until the last item is found which will have a link
of -1.
A binary tree is similar to a linked list except that every node has two links. An example
is shown in Figure 25.5. The flow starts at the top and each parent node has up to two
child node. A node without any children is called a terminal node.
Binary trees are used for several different applications. The example used here is
that the tree is used to sort the data. As seen in Figure 25.5 every child node has a data
value larger than its parent. When a new node is added it is attached at any open child
location. Then the process moves the node upwards according to the rule that all parents
must have a lower data value than their children.
Consider the case in Figure 25.6 in which nodes V1 and V4 violate this rule. The
procedure is for V1 and V4 to swap positions in the tree. The result is shown in Figure
25.7. This process continues moving V4 upwards until the parent/child rule is no longer
359
Figure 25.5: A binary tree.
violated.
Figure 25.6: A tree for sorting with incorrect positions of V1 and V4.
The swapping process looks easy, but it does involve several other nodes. Figure
25.8 shows the same tree but highlights all of the links that need to be adjusted when
swapping V1 and V4.
After all of the data is in the three then the next step is to remove nodes such that
the data is in order from lowest value to highest. If the parent/child rule is obeyed then
the node with the lowest data value must be at the top of the tree.
The data from this node is placed into an answer list and this node is then removed
from the tree. One of the two child must be raised up to replace this node. The child
with the lowest data value is chosen and moved up to replace the parent. This is shown
in Figure 25.9
This leaves an empty slot and one of the children, V1 or V3, must move up into the
empty slot. The child with the lowest data value is chosen and moved upwards. In this
360
Figure 25.7: A tree for sorting.
361
Figure 25.9: Removal of the first node.
case that is V1. The result is shown in Figure 25.10, but as seen this leaves a new hole in
tree.
The steps of replacing a hole are repeated until a terminal node is reached. The
result is shown in Figure 25.10.
After this is completed then the new top node is removed and the data is placed
into the answer list. Again this leaves a hole at the top and the process of moving nodes
upwards to replace holes is repeated. The removal of the top node and hole-filling is
repeated until the tree is empty. The answer list will contain all of the data in a sorted
order.
While this process is a little more complicated than brute force searches it is sig-
nificantly faster for large data. Consider the case were there is 1,000,000 pieces of data.
To find the minimum value a program would need to search the entire list of data. Thus,
the loop would have 1,000,000 comparisons. That only finds the first maximum. To
sort the data this process is repeated 1,000,000 times except that each time that it is
repeated the size of the data set is slightly smaller. So, the total number of comparisons
is 1, 000, 000 × 1, 000, 000/2 = 5 × 1011 .
362
Figure 25.11: Replacing a hole completion.
363
Now consider the tree search with 20 layers. Each time a node is added it could
have up to 20 swaps to properly place it is in the tree. Although on average the number
of swaps would be less than 10. The same is true for the process of removing a node. So,
each node is responsible for roughly 20 swaps (or comparisons). Since there are 1,000,000
nodes the adding and removal process needs to be repeated that many times. So the
sorting process using a tree requires roughly 1, 000, 000 × 20 = 2 × 107 comparisons. That
is significantly less than the brute force method.
Creating Python code for a binary tree is almost the same as a linked list. Code
25.5 shows the same concept of using a dictionary except that each node has two possible
links. Since this is the only node in the tree both links are -1.
1 >>> tree = {}
2 >>> tree [0] = [0.4 , -1 , -1]
25.5 UPGMA
The UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm builds
a simple tree by continually finding best pair matches and replacing them with a parent
node. Consider a case shown in Code 25.6 which generates six vectors, each of length 10,
containing random values. The matrix M shows the cost values of every possible pair. The
difference value is subtracted from the max value thus creating a cost such that a lower
cost is a better match. Only the lower left portion of the matrix needs to be computed.
In this example the best score is 8.5 and belongs to column 1 and row 5. Therefore,
the best matching data vectors are data[1] and data[5]. The UPGMA creates a small
tree from these two data vectors as shown in Figure 25.13 where each data vector is
represented by V. At the top of this tree is V7 which is not part of the original data.
This is an artificial data vector created from the average of V1 and V5. This new data
vector is added to the other data vectors and V1 and V5 are both removed from further
consideration.
This maneuver will require that M be of a bigger size. In fact, in the UPGMA
algorithm the size of M is (2N − 1) × (2N − 1) where N is the number of original data
364
Code 25.6 Creating data.
1 >>> from numpy import random, zeros
2 >>> data = random.ranf( (6,10) )
3 >>> M = zeros( (6,6), float )
4 >>> for i in range( 6 ):
5 for j in range( i ):
6 M[i,j] = 10 - (abs( data[i]-data[j])).sum()
7
8 >>> M
9 array([[ 0. , 0. , 0. , 0. , 0. , 0. ],
10 [ 7.2, 0. , 0. , 0. , 0. , 0. ],
11 [ 6.1, 5.4, 0. , 0. , 0. , 0. ],
12 [ 5.6, 5.5, 6.9, 0. , 0. , 0. ],
13 [ 6.0, 6.1, 7.0, 7.4, 0. , 0. ],
14 [ 6.8, 8.5, 6.3, 6.3, 7.0, 0. ]])
vectors. Code 25.7 initializes M to this new size and fills it with the scores as in Code
25.6. The maximum value is located and returned as location v, h. Furthermore, there is
going to be a need for (2N − 1) data vectors and these are established as vecs.
Figure 25.13 requires that a new vector, vecs[7], be created which is shown in Code
25.8. The loop in Lines 3 and 4 computes the score of this new vector to the others. Lines
5-8 eliminate all rows and columns that are associated with vecs[1] and vecs[5]. The
variable last keeps track of the last known vector in the list and it increments with each
new vector.
The next iteration finds the new largest value in M and repeats the process. In this
example vecs[3] and vecs[4] generate the best match and so a new tree is created with
these two. This is shown in Figure 25.14. On the third iteration the best match is between
365
Code 25.8 Altering M after the creation of a new vector.
1 >>> last = 7
2 >>> vecs[last] = (vecs[v] + vecs[h])/2.0
3 >>> for i in range( last ):
4 M[last,i] = 10 - (abs( vecs[i]-vecs[last])).sum()
5 >>> M[v] = zeros(11)
6 >>> M[h] = zeros(11)
7 >>> M[:,v] = zeros(11)
8 >>> M[:,h] = zeros(11)
9 >>> last += 1
vecs[2] and vecs[8], however, vecs[8] is already in a tree. Thus, vecs[2] is attached
to the existing tree creating vecs[9]. This is shown in Figure 25.15. The final type of
iteration is one in which both of the vectors exist in different trees. In this case the two
trees are joined together as shown in Figure 25.16.
The UPGMA function is shown in Code 25.9. The input data indata is a list of
the data vectors (not a matrix). The scmat is the matrix that contains the pairwise scores
(similar to the M matrix in previous examples). The list net collects the nodes as they are
computed. The list used collects the names of data vectors after they have been used to
prevent the re-use of these vectors. In the loop starting on Line 12 the best match is found
in Line 13 which returns the location in M where the best match occurs. It is appended
to the net and the average of the two constituent vectors is computed in Line 16. The
loop starting on Line 18 computes the similarity of the new vector with the previous only
366
Figure 25.16: The third iteration.
if they have not been previously used. The final command removes the comparison scores
for the two vectors that are being removed from further consideration.
In the example at the end six random data vectors of length 10 are used as inputs.
The tree is computed and printed. Recall that the tree is a dictionary and the data of the
dictionary contains the two children and the score. The tree produced by this system is
shown in Figure 1 9.
367
Code 25.9 The UPGMA function.
1 # upgma.py
2 def UPGMA( indata ):
3 data = copy.deepcopy( indata )
4 N = len( data ) # number of data vectors
5 N2, BIG = 2*N-1, 999999
6 scmat = np.zeros( (N2,N2), float ) + BIG
7 # initial pairwise comparisons
8 for i in range( N ):
9 for j in range( i ):
10 scmat[i,j] = (abs( data[i]-data[j] )).sum()
11 tree, used = {}, []
12 for i in range( N-1 ):
13 v,h = divmod( scmat.argmin(), N2 )
14 tree[N+i] = (v, h, scmat.min() )
15 used.append( v ); used.append( h )
16 avg = ( data[v] + data[h])/2.
17 data.append( avg )
18 for j in range( N+i ):
19 if j not in used:
20 scmat[N+i,j] = (abs( avg-data[j] )).sum()
21 scmat[v] = np.zeros( N2 ) +BIG
22 scmat[h] = np.zeros( N2 ) +BIG
23 scmat[:,v] = np.zeros( N2 ) +BIG
24 scmat[:,h] = np.zeros( N2 ) +BIG
25 return tree
26
368
digital formats.
Of course it is possible to have a non-binary tree. In some cases, such as the dictionary
shown in Figure 25.18 this is desired. The only real difference to the Python script is the
number of links that are allowed. In this case the link integer can be replaced with a list
of links that can grow or shrink depending on the number of links that a node has.
A decision tree is used to sort through a decision that involves multiple components.
Consider the case of sorting through health information. In this case data from several
369
people is collected. Some of these people have a specified illness while the others do not.
The data collected can include things such as (smoking, drinking, living location, age,
diet, exercise, genetics, etc.). Which of these factors contribute to the illness?
25.7.1 Theory
Consider just one of the factors such as sugar intake. Each person has a certain number of
grams of sugar they consume each day. The chart in Figure 25.19 shows the distribution
of healthy and sick patients versus their sugar intake. The x-axis is the amount of sugar.
The green line (on the right) shows the histogram of sick people versus their sugar intake.
The red line (on the left) shows the histogram of the healthy people.
As seen the distributions are quite distinct and therefore the sugar intake is a good
indicator of whether a person is going to be contract this particular disease. In this case
a vertical line can be drawn where the two curves intersect. This is the decision line and
is shown in Figure 25.20. If a new patient is seen and their sugar intake is measured then
the decision to be made is basically if they are left or right of this decision line. In this
case the decision line is about x = 1.8. Now, this decision is not perfect as some people
with x < 1.8 have become sick and some people with x > 1.8 remain healthy.
This example is an ideal case and usually reality is more like the distribution shown
in Figure 25.21. A decision line can still be created but there will be a lot of people that
will be erroneously classified.
The bell curves, or Gaussian distributions, can be computed from the average and
standard deviations of the data. The height of the curve is,
2 /2σ 2
y = Ae−(x−µ) , (25.1)
where µ is the average and σ is the standard deviation. For this case the amplitude, A,
is set to 1. The x is the input variable (location on the horizontal axis) and the y is the
output (height of the function). The µ is the horizontal location of the center of the curve,
and the σ is the half width at half the height.
370
Figure 25.20: A decision.
The crossover point occurs when both curves have the same y value for a given input
x. Thus,
2 2 2 2
e−(x−µ1 ) /2σ1 = e−(x−µ2 ) /2σ2 , (25.2)
where the subscripts 1 and 2 represent the two curves. The next step is to solve for x and
so the log of both sides becomes,
and each side is multiplied by − 12 and then the square root of both sides produces,
x − µ1 x − µ2
= . (25.4)
σ1 σ2
Now it is possible to solve for x. However, there is an issue in that these two curves may
actually have two crossover points. Such a case is shown in Figure 25.22(a). So, the proper
equation is,
x − µ1 x − µ2
=± , (25.5)
σ1 σ2
after noting that in the process of computing a square root that it is possible to have two
solutions. Usually, the point that is to be used is the crossover point that is in between
the two peaks.
A decision tree considers each of the variables of which three examples are shown
in Figure 25.22. None of the variables is dividing the data nicely. However, the second
variable performs better than the others. So, this variable is selected as the first node in
a decision tree.
The decision line is created and all of the data is sorted according to the decision
line. Of course, some of the data will be mis-classified. An example (from a different
371
Figure 25.21: Closer to reality.
372
problem) is shown in Figure 25.23. This node uses parameter (or factor) 4. The decision
line is at x = 0.52. The training data is sorted as shown.
Had this node been able to perfectly sort the data then all of the data on one side
would be classified as False (healthy) and all of the data on the other side would have
been classified as True (sick). As seen this node was not perfect.
So, the next step is to create children nodes based on the sorted data. So, the
child node on the left would only consider the data that was sorted to the left in this
initial node. The process continues until every node either has a child node or the data is
perfectly sorted as shown in Figure 25.24.
25.7.2 Application
This section will walk through a demonstration of building and using a decision tree. In a
single tree there are multiple nodes which have attributes and functions. Therefore, there
373
is an advantage for creating the nodes as an object-oriented class. Furthermore, a real
problem could employ more than one tree and thus the tree is also constructed as a class.
First, though, it is important to generate a data set for this example problem.
25.7.2.1 Data
In order to generate usable data for a decision tree it is necessary that the data have some
structure. It is not possible to make a decision on purely random data.
Fake data is created in the FakeDtreeData function shown in Code 25.11. The
philosophy is that this is generating data for N patients and for each patient a set of
M parameters are measured. Each patient is classified as either sick or healthy (True or
False). The function receives the N and M parameters as arguments.
374
Code 25.12 Using the FakeDtreeData function.
Storing the information in a list is not the most efficient method for computation
processing. So, the next step is to create two matrices. One matrix will contain the data
for sick patients and the other will store data for healthy patients. Each row in a matrix
is the patients M parameters. Since the number of patients is not set the data is collected
in lists as shown in lines 1 through 6 in Code 25.12. The last two lines then converts these
lists into matrices.
1 >>> Ts , Fs = [] , []
2 >>> for d in data :
3 if d [1]:
4 Ts . append ( d [2] )
5 else :
6 Fs . append ( d [2] )
7
8 >>> Ts = np . array ( Ts )
9 >>> Fs = np . array ( Fs )
The first step in creating the first node in the decision tree is to compute the ability of each
parameter to separate the sick patients from the healthy patients. This follows the process
shown in Section 25.7.1. For each of the M parameters the distributions are computed
and the intersections of the distribution curves is determined. The score is the ability of
a parameter to separate the two classes of patients.
The ScoreParam function computes the score for a single parameter. It is a rather
lengthy function and so it is not shown here. However, Code 25.14 shows the concepts of
the function. The call to the function receives the data and the parameter being tested.
Thus, this function will be called M times, once for each parameter.
The first step is to gather the statistics for that single parameter. These are the
average and standard deviation for that parameter for the sick and again for the healthy
patients. If the standard deviation is less than 0.1 then it is set to 0.1. Values that are
too small generally appear from small data sets and are not representative of the actual
375
data.
1 # decidetree . py
2 def ScoreParam ( data , prm ) :
3 # convert to vectors and get stats
4 # # avg and stdev of sick and healthy , stdev min = 0.01
5 # find crossover
6 # count the sicks on the left side
7 # count the healthy on the left side
8 return score , x
9
From the average and standard deviations the Gaussian distributions can be plotted
according to Equation (25.1). Line 5 indicates that the next step is to find the crossover
point which follows the discussion ending with Equation (25.5). Now, the node has the
crossover point and it is possible to separate the data vectors by sending each vector to
either the left or right branches of the node. The next step in this algorithm is to determine
the percentage of sick and healthy people that went to each side of the nodes. If this node
perfectly separated all of the patients into sick and healthy then it will produce a high
score of 1.0. This function returns that score and the crossover point x.
Two examples are shown. The first example computes the score for all of the data
for the first parameter. The score is 0.55 an the crossover value is x = 0.4. The second
example performs the same test for the first ten vectors only. Of course the score is closer
to 1 since it is easier to separate few data vectors. The crossover point, though, is almost
the same, which lends confidence that the process is behaving.
This process is applied to all nodes and the one with the highest score is believed to
be the best at separating the healthy patients from the sick patients. It will become the
top node in the tree.
25.7.2.3 A Node
The tree will consist of several nodes and therefore there is justification for an object-
oriented approach. Each node will need to contain several values. It will need the pa-
rameter number (a value between 0 and M − 1) and the crossover value. These will be
stored as self.param and self.gamma. The node will also need to know which children
are connected to it. This is a binary tree and so the two possible branches are self.K1
376
and self.K2. The node will receive a list of sick and healthy vectors. These are stored as
self.G1 and self.G2.
The top node will be able to consider all of the parameters. The child node, however,
does not consider the parameters that were used by its ancestors. Thus, each node needs
a list of parameters that can be considered in creating the crossover value. This list of
indexes is stored in self.avail. Finally, the node will keep track of the identity of its
mother node as self.mom.
This class also has several functions but only the function names and returns are
shown in Code 25.15 due to the size of the program. The constructor initializes all of
the parameters. The Set function receives the two matrices and a list of possible indexes
which is usually a list list(range(M)). This function will then put the proper values into
the class variables.
Code 25.15 The variable and function names in the Node class.
1 # decidetree . py
2 class Node :
3 def __init__ ( self ) :
4 self . param , self . gamma = -1 , -1
5 self . K1 , self . K2 = -1 , -1
6 self . G1 , self . G2 = [] , []
7 self . avail = []
8 self . mom =-1
9 def Set ( self , G1l , G2l , alist ) :
10 ...
11 def Decide ( self , G1vecs , G2vecs ) :
12 ...
13 def Split ( self , G1vecs , G2vecs ) :
14 ...
15 return lg1 , lg2 , rg1 , rg2
16 def __str__ ( self ) :
17 ...
18 return s
The Decide function is used to determine which of the parameters from self.avail
best separates the given data. This function will set the variables self.param and
self.gamma.
The Split function will decide which data vectors will be sent to the left child or
the right child. This function returns four lists. The first two are the sick and healthy
patients that went to the left node and the other two are the sick and healthy patients
that went to the right node. These will be used in the construction of other nodes.
The final function is str which is used by the print function to print information
377
about the node to the Python console.
The decision tree is created from several linked nodes. Since it is possible that a real
problem could have several trees a new class is created. The tree consists of a list of
nodes which are store as self.nodes. It also contains a list named self.next. When
a node is created it can create two children nodes which will have to be considered in
subsequent computations. As an example, the first node is nodes[0] and it creates two
children nodes[1] and nodes[2]. The program will then consider nodes[1] to compute
its crossover point and it will create nodes[3] and nodes[4] before nodes[2] has been
considered. So, this list contains the IDs of the nodes that have been created but have
not yet been processed to determine its crossover points. When a node is processed
to determine its internal values and children then it is removed from self.next. The
amount of data passed down to a child is about half of the data that the mother node
has. Eventually, the tree reaches a node that separates its subset of data perfectly and
no children are required for this node. Thus, the list self.next will grow as the initial
nodes are created and then shrink as the tree reaches the end nodes. When self.next is
empty the construction of the tree is complete.
There are several functions associated with the Tree class and the function names
are shown in Code 25.16. The SetDataVecs function receives the list of sick and healthy
data vectors. For this first node this is all of the data, but for the children nodes this
is only the data that is passed down from its mother. The Mother function determines
the parameters for the first node and returns the four lists that it will pass down to its
two children. The MakeKids function will make the two child nodes for a given mother.
It will determine the self.K1, self.K2, and self.mom but the other parameters will be
determined later. An example is shown in Code 25.17 which creates an instance of the
tree in line 2. It then provides the data that was generated and computes the mother
node.
Code 25.18 displays some of the information from this first node. Currently, it is not
connected to children nodes (lines 1 through 4) and the identities of the sick and healthy
patients are contained in lists. In this data set there are 11 sick and 9 healthy patients. It
has been determined that the second node (number 1) best separates the data and that
the crossover point is x = 0.623.
The Iterate gets the next node ID from self.next and then proceeds to determine
its crossover and parameter values. It then separates the data for this node’s children
into the four lists. Finally, it calls MakeKids to make its children nodes. The function
MakeTree continually calls Iterate until the tree is completely built. Code 25.19 com-
putes the first set of children and now the algorithm has enough started to finish the tree
using the MakeTree function.
378
Code 25.16 The titles in the TreeClass.
1 # decidetree . py
2 class Tree :
3 def __init__ ( self ) :
4 self . nodes = {} # empty dictionary : ID = key
5 self . next = [] # list of who to consider next
6 def SetDataVecs ( self , Tvecs , Fvecs ) :
7 self . Tvecs = Tvecs + 0
8 self . Fvecs = Fvecs + 0
9 def Mother ( self ) :
10 ...
11 return lg1 , lg2 , rg1 , rg2
12 def MakeKids ( self , me , lg1 , lg2 , rg1 , rg2 ) :
13 ...
14 def Iterate ( self ) :
15 me = self . next . pop ( 0 )
16 self . nodes [ me ]. Decide ( self . Tvecs , self . Fvecs )
17 lg1 , lg2 , rg1 , rg2 = self . nodes [ me ]. Split ( self . Tvecs ,
self . Fvecs )
18 self . MakeKids ( me , lg1 , lg2 , rg1 , rg2 )
19 def MakeTree ( self ) :
20 while len ( self . next ) > 0:
21 self . Iterate ( )
22 def Trace ( self , query ) :
23 ...
24 return trc , nodes
379
Code 25.18 The information of the mother node.
1 >>> print ( t . nodes [0]. K1 )-
2 1
3 >>> print ( t . nodes [0]. K2 )-
4 1
5 >>> print ( t . nodes [0]. G1 )
6 [0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10]
7 >>> print ( t . nodes [0]. G2 )
8 [0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8]
9 >>> print ( tree . nodes [0]. param )
10 1
11 >>> print ( tree . nodes [0]. gamma )
12 0.623049932763
25.7.2.5 A Trace
The Trace function is used after the tree is built. This function receives an input data
vector which could be one from the patients used in building the tree or a new patient not
yet seen before. This function will start with the top node and determine if this patient
should go to the left or the right child. The process iterates down the tree until it reaches
an end node. The input is classified as the type of patients in the final branch of the trace.
It returns information about the path that it took going down the tree and the nodes that
it used.
Consider the information from the first patient data[0]. Recall that this is a tuple
and that the third item is the patient’s data. This is the vector data[0][2]. The mother
node determined that the parameter to use was the 1 parameter. For this patient that
measurement was 0.717 as seen in Code 25.20. The top node in the tree is tree.nodes[0]
and it was determined that its crossover point was 0.623 (see Code 25.18). Since 0.717 >
0.623 the decision is to send this patient to the right child (K2). This is node number 2
and as seen this node uses parameter 2 with a crossover point of 0.189.
The process continues. At each node the decision is made as to whether to send the
patient to the left or right. Eventually the process comes to an end node. This is shown
in Code 25.21. In this case the decision from node 2 leads to node 6. This node does not
have any children as denoted by the -1 values for param and gamma. That means that this
node has perfectly separated the data that was given to it. Thus, the end of the tree has
been reached.
380
Code 25.20 Comparing the patient to the first node.
The entire process is captured in the Trace function and the call is shown in Code
25.22. The input is the vector from the first patient. The trace shows that the decisions
were to go: right, right and right. In this case, the nodes were 0, 2 and 6. The classification
of the nodes[6] is used as the classification of the patient. Line 4 prints the information for
the last node in the trace. The list nodes[6].G1 has 11 entries but the list nodes[6].G2 is
empty. Thus, all of the data that reached this node were sick patients. The input patient
is classified as sick. In this case, the diagnosis of the patient is known. This is printed
in lines 10 and 11. The value of True indicates that the patient was sick and so the tree
classified the patient correctly.
At the end of the decidetree.py module there is a function named Example which
381
shows the steps for the entire process of generating the data, building the tree, and then
running a trace. The input is a seed for the random number generator. If the seed is 20236
then the above results are replicated. Other seed numbers will generate other patients.
Projects
1. Create a list that contains all of the sentences from the play Romeo and Juliet. Each
item in this list is one sentence from the play. Using a linked list, sort the sentences
from shortest to longest.
2. In this project a decision tree is created from the two bacterial genomes. For each
genome create a list of codon frequencies for all genes of sufficient length (more than
128 nucleotides). Declare one of the genomes to be class 1 and the other to be class
0. Using 90% of the vectors from each list create this decision tree. Use the other
10% for testing. Determine the percentage of the testing vectors that the tree can
correctly classify.
382
Chapter 26
Clustering
Given a set of data vectors {X : ~x1 , ~x2 , ..., ~xN } then object is to group the vectors such
that each group contains only those vectors that are similar to each other. The measure of
similarity is defined by the user for each particular application. The number of clusters can
either be fixed or dynamic depending on the algorithm chosen. The result of the algorithm
will be a set of groups and the constituents of each group is the set of self-similar vectors.
Code 26.1 creates a simple function named CData that generates random data for
clustering. Purely random data would be inappropriate for clustering so this algorithm
generates a small number of random seeds, and then it generates data vectors that are
random deviations from these seeds. In this fashion some of the vectors should be related
to each other through a common seed. These vectors, therefore, should find a reason to
cluster. The variable N is the number of vectors to be generated and the L is the length
of the vectors.
Code 26.2 presents a simple algorithm for comparing one vector to a set of vectors.
The comparison is performed by the absolute subtraction,
X
s= k~t − d~i k, (26.1)
i
383
Code 26.1 The CData function.
1 # clustering . py
2 def CData ( N , L , scale = 1 , K =-1 ) :
3 if K ==-1:
4 K = int ( np . random . rand () * N /20 + N /20)
5 seeds = np . random . ranf ( (K , L ) )
6 data = np . zeros ( (N , L ) , float )
7 for i in range ( N ) :
8 pick = int ( np . random . ranf () * K )
9 data [ i ] = seeds [ pick ] + scale *(0.5* np . random . ranf ( L )-
0.25)
10 return data
11
where the vector ~t is the target and d~i is the i-th data vector. In Code 26.2 diffs is a
matrix that contains the subtraction of the target vector from all of the vectors in vecs.
This command looks a bit odd in that the two arguments of the subtraction do not have
the same dimensions. Python understands this predicament and performs the subtraction
of the target vector with each row of vecs. The result is diffs which is the same dimension
as target. The sum command only sums along axis #1 which is the second dimension in
diffs.
1 # clustering . py
2 def CompareVecs ( target , vecs ) :
3 N = len ( vecs )
4 diffs = abs ( target - vecs )
5 scores = diffs . sum ( 1 )
6 return scores
7
The executed command in Code 26.2 computes the comparison of the first vector
with the entire data set. A perfect match of a vector with the target will produce a score of
0. Code 26.3 sorts the scores and creates a plotting file that is shown in Figure 26.1. The
argsort function returns an array of indexes for the data sorted from lowest to highest.
Thus, scores[ag[0]] is the lowest score and scores[ag[-1]] is the highest score. The
ag is an array and it is used as an index in scores[ag]. This will extract the values of
384
scores according to the indexes of ag.
This plot is typical for this data using a different vector as a target. There are a few
vectors that are similar to the target and many that are dissimilar. There seems to be a
sharp differentiation between 2 < γ3. Thus, a threshold is chosen to be 2.5 so that any
score less than the threshold is considered to be a good match.
As a control experiment a simple greedy algorithm is created. One vector is chosen
as the target and all of the vectors that are close to it (scoring below the threshold value)
are collected as a single group. Vectors that belong to a group are not considered for
further grouping. This program has obvious problems in that a vector may not belong to
the best group. Consider a case in which vector C ~ is similar to vector A
~ and very similar
~ ~ ~
to vector B. A is chosen as the first target and thus C would be chosen to belong to that
group, thus preventing C ~ from joining the B~ group for which it was better suited. This,
algorithm is merely a control algorithm to which better algorithms can be compared.
Code 26.4 displays the simple function CheapClustering for clustering data by
this greedy method. The data is converted to the list, work, to take advantage of some of
385
the properties of lists. The pop function removes a vector from the list and thus target
becomes this vector and it no longer exists in work. The nonzero function will return a
tuple containing the indexes of those scores that are less than the threshold, and the [::-1]
in Line 9 reverses the indexes so that the largest is first. The list group started in Line 10
collects the vectors that are deemed to be similar to the target in Lines 11 and 13. Once
the group is collected it is appended to the list of clusters in Line 16.
1 # clustering . py
2 def CheapClustering ( vecs , gamma ) :
3 clusters = [ ] # collect the clusters here .
4 ok = 1
5 work = list ( vecs ) # copy of data that can be destroyed
6 while ok :
7 target = work . pop ( 0 )
8 scores = CompareVecs ( target , work )
9 nz = nonzero ( less ( scores , gamma ) ) [0][::-1]
10 group = []
11 group . append ( target )
12 for i in nz :
13 group . append ( work . pop ( i ) )
14 clusters . append ( group )
15 if len ( work ) ==1:
16 clusters . append ( [ work . pop (0) ])
17 if len ( work ) ==0:
18 ok = 0
19 return clusters
20
The ordering of nz from highest to lowest is necessary for the loop starting in Line 12.
Consider a case in which the ordering is from lowest to highest and in this example vectors
2 and 4 are deemed close to the target. The pop function on Line 19 would remove vector
2 from the list. In doing this vector 4 would become vector 3 and in the next iteration the
removal of vector 4 (which would be the next item in nz) would remove the wrong vector.
By considering the vectors from highest to lowest this problem is averted.
In this particular experiment 6 clusters were created and they contained the following
number of members (26, 21, 19, 11, 21, and 2). These clusters will be compared to the
k-means clusters generated in the next section. A good cluster would collect vectors that
386
are similar and thus a single cluster should have a small cluster variance as measured by,
1 X 2
ωk = σk,j , (26.2)
Nk
i
2 is the variance of the i-th element of the k-th cluster, and N is the number
where σk,j k
of vectors in the k-th cluster. For each cluster the variance of the vector elements are
computed and summed. This scalar measures the variance of the vectors in a cluster. For
the example case the variances of the 6 clusters are shown in Code 26.5.
The k-means clustering algorithm is an extremely popular and easy algorithm. The user
defines the number of clusters, K, and a method by which these clusters are seeded. The
algorithm will then perform several iterations until the clusters do not change. Each
iteration consists of two steps. The first is to assign each vector to a cluster thus creating
the cluster’s constituents. The second is to compute the average of each cluster. If a
vector is determined to belong to a different cluster then it changes the constituency of
the clusters and thus in the next iteration the averages will be different. If the averages are
different then other vectors may shift to new clusters. The process iterates until vectors
do not change from one cluster to another.
The steps are:
1. Initialize K clusters.
2. Iterate until there is no change
387
(a) Assign vectors to clusters
(b) Compute the average of each cluster
(c) Compare the previous clusters to the new clusters. If there is no change between
the two sets then set the STOP condition.
Each cluster is constructed from an initial seed vector. This vector can be a random
vector, one of the data vectors, or some other method as defined by the user. Usually, the
measure of similarity between a vector and a cluster average is a simple distance measure,
but again the user has the opportunity to alter this if an application needs a different
measure.
Code 26.6 displays two possible initiation functions. The function Init1 receives
the number of clusters and the length of vectors and just generates random vectors. The
problem with this approach is that there is no guarantee that a cluster will collect any
constituents. The function Init2 randomly selects one of the data vectors as a seed for
each cluster. It generates a list of indexes and shuffles them in a random order. The first
K indexes of this shuffled order are used as the seed vectors. In this function the take
function contains two arguments. The first is a list of indexes to be taken. The second
is the axis argument and this forces the take function to extract row vectors from data
instead of scalars.
Once an initial set of clusters is generated the next step is to assign each vector to
a cluster. This assignment is based on the closest Euclidean distance from the vector to
each cluster. Code 26.7 displays the AssignMembership function that computes these
assignments. In this function the list mmb is a list that collects the constituents for each
cluster and it contains K lists. The mmb[0] is a list of the members of the first cluster. This
list contains the vector identities, thus if mmb[0] = [0,4,7] then data[0], data[4] and
data[7] are them members of the first cluster. There are two for loops in this function.
The first initializes mmb and the second performs the comparisons and assigns each vector
to a cluster. In the second loop the score for each cluster is contained in the vector sc
and mn indicates which cluster has the best score.
388
Code 26.7 The AssignMembership function.
1 # kmeans . py
2 # Decide which cluster each vector belongs to
3 def AssignMembership ( clusts , data ) :
4 NC = len ( clusts )
5 mmb = []
6 for i in range ( NC ) : for i in range ( len ( data ) ) :
7 sc = zeros ( NC )
8 for j in range ( NC ) :
9 sc [ j ] = sqrt ( (( clusts [ j ]-data [ i ]) **2 ) . sum () )
10 mn = sc . argmin ()
11 mmb [ mn ]. append ( i )
12 return mmb
The next major step is that each cluster needs to be recomputed as the average of
all of its constituents. Thus, if mmb[0] = [0,4,7] then clust[0] will become the average
of the three vectors mmb[0] = [0,4,7] then data[0], data[4] and data[7]. Code 26.8
displays this function as ClusterAverage. On line 7 vecs is the set of vectors for the
i-th cluster. Recall that vecs is actually a matrix where the rows are the data vectors.
Thus, the k-th element of the average vector is the average of the k-th column of vecs.
The mean function on line 8 uses the 0 as an argument to compute the average of the
columns of the matrix.
1 # kmeans . py
2 def ClusterAverage ( mmb , data ) :
3 K = len ( mmb )
4 N = len ( data [0] )
5 clusts = zeros ( (K , N ) , float )
6 for i in range ( K ) :
7 vecs = data . take ( mmb [ i ] ,0 )
8 clusts [ i ] = vecs . mean (0)
9 return clusts
These are the major functions necessary for k-means clustering. The next step is to
create the iterations. Code 26.9 demonstrates the entire k-means algorithm. The initial
cluster clust1 is created on line 4. The ok flag set in line 5 is used to control the loop
in line 6. When ok is False then the loop will terminate. Line 7 places each vector in a
cluster and Line 8 computes the average of the clusters. Line 9 computes the difference
between the current cluster and the previous cluster. If there is no difference then line 11
will set the ok flag to False. Line 13 replaces the old cluster with the current cluster in
389
preparation for the next iteration or the return statement in line 14.
Code 26.10 displays an example using the same data and same number of clusters
from the previous section. The variances of these clusters are printed in lines 11 through 16.
These variances are on the whole smaller than those from the greedy algorithm indicating
that the members of these clusters are more closely related than in the previous case.
One of the clusters does have a higher variance than the other clusters. In the
k-means algorithm every vector will be assigned to a cluster. Even a vector that is not
similar to any other vector must be assigned to a cluster. Often this algorithm will end
up with one cluster that collects outliers and has a higher variance. The solution to this
is discussed in Section 26.4. However, it is important to first discuss how to solve more
difficult problems in Section 26.3.
The Swiss roll problem is one in which data is organized in a spiral. One thousand data
points are shown in Figure 26.2. The data is created by MakeRoll in Code 26.11 which
then displays the creation of the data.
Using ordinary k-means it is possible to cluster the data. Code 26.12 shows the
RunKMeans function which clusters the data using the k-means algorithm. The clusters
are initialized in line 3 and in line 5 through 7 the standard k-means protocol is followed.
Code 26.13 uses the GnuPlotFiles function which will create plot files suitable for
GnuPlot or a spreadsheet.
The results of the k means clustering is shown in Figure 26.3. Each colored region
390
Code 26.10 A typical run of the k-means clustering algorithm.
11 0 0.014
12 1 0.020
13 2 0.018
14 3 0.021
15 4 0.019
16 5 0.0177
391
Code 26.11 The MakeRoll function.
1 # swissroll . py
2 def MakeRoll ( N =1000 ) :
3 data = np . zeros ( (N ,2) , float )
4 for i in range ( N ) :
5 r = np . random . rand ( 2 )
6 theta = 720* r [0] * np . pi /180
7 radius = r [0] + ( r [1]-0.5) *0.2
8 x = radius * np . cos ( theta )
9 y = radius * np . sin ( theta )
10 data [ i ] = x , y
11 return data
12
392
represents the members of a cluster. As seen members of one cluster are on two different
parts of the spiral arm. In this results the vectors represent the clusters are not on the
bands. For example, the average of the first cluster is located at (0.58, -0.27). This is in
between the two sections of points denoted by the red diamonds.
This example illustrates one of the main problems that users encounter with applying
a machine learning algorithm to data. It is essential to understand that nature of the
problem so that the algorithm can be used properly. If, in this case, the user wishes to
have clusters restricted to a single arm of the spiral then it is necessary to adjust the
algorithm. There are two possible avenues in which this can be accomplished. The first
is to represent the data in a different coordinate system, and the second is to modify the
k-means algorithm.
Knowing that the data is in some sort of spiral is evidence that a different representation
of the data is warranted. Since the data is in a spiral, polar coordinates are warranted. In
other applications the data may need to be transformed by more involved mathematics.
Code 26.14 shows the function GoPolar which performs this translation via
p
r= x2 + y 2 , (26.3)
393
and
y
θ = tan−1 . (26.4)
x
In this program the function atan2 is used instead of atan because atan2 is sensitive
to quadrants. The answer has a range of 360 degrees, whereas the atan function has a
range of 180 degrees. The result is that each pdata[k] is the polar coordinates of each
data[k]. This function makes one small adjustment in that it multiplies the radius by a
factor of 10 which puts the radial and the angular values on the same scale.
The converted data is now clustered by the same k-means algorithm as shown in
Code 26.15. Note that the data sent to GnuPlotFiles is the Cartesian data and not the
polar data. This is necessary since the plot is in Cartesian coordinates. However, the
clusters are defined from the polar data. The results are shown in Figure 26.4. By simply
casting the data into a different coordinate space the clustering is significantly different
and in this case produces the desired result.
Another approach is to realize that in this case the Euclidean distance between data points
is not the desired metric of similarity. The clusters should follow the trend of the data
394
Figure 26.4: Clustering after converting data to radial polar coordinates.
which is defined by the proximity of data points. Readers will see a spiral but this is merely
an illusion created by the density of data points. Thus, for this case, a better metric is
to measure the geodesic distances between data points. Two points that are neighbors
have a distance measured by the Euclidean distance, but two points that are farther apart
measure their distance as the shortest distance that connects through intermediate points.
Thus, if there are three points A, B, and C the distance between A and C is the distance
from A to B and then B to C. The geodesic distance is the shortest path that connects
data points.
In order to accomplish this modification it is necessary to compute the shortest
distance between all possible pairs of points. The Floyd-Warshall[Cormen et al., 2000]
algorithm performs this task in very few steps. The algorithm contains three nested
for-loops which in Python would run very slow. So, the Python algorithm uses an outer-
addition algorithm that contains two of the for-loops. This function performs,
The FastFloyd function in Code 26.16 computes the shortest geodesic distance to
all pairs of points. Even this more efficient version of the Floyd-Warshall algorithm can
take a bit of time and the print statement is merely to show the user progress of the
algorithm.
The input to FastFloyd is a matrix of all the Euclidean distances for all pairs
395
Code 26.16 The FastFloyd function.
1 # swissroll . py
2 def FastFloyd ( w ) :
3 d = w + 0
4 N = len ( d )
5 oldd = d + 0
6 for k in range ( N ) :
7 print ( str ( k ) + ' ' , end = ' ' )
8 newd = np . add . outer ( oldd [: , k ] , oldd [ k ] )
9 m = newd >700
10 newd = (1-m ) * newd + m * oldd
11 mask = newd < oldd
12 mmask = 1-mask
13 g = mask * newd + mmask * oldd
14 oldd = g + 0
15 return g
of points. The Floyd-Warshall algorithm will then search for shorter distances using
combinations of intermediate data points. Code 26.17 shows the function Neighbors
function that converts the data to Euclidean distances and then calls FastFloyd. The
result is a matrix that contains the geodesic distances for all possible pairs of points.
Finally, the k-means algorithm is modified. In the original version the vectors were
assigned to the cluster that was closest to the vector in a Euclidean sense. In this new
version the vector is assigned to the cluster that is closest in a geodesic sense. So, the
AssignMembership algorithm is modified. It first finds the data point that is closest to
each cluster. Then, it adds that distance to the geodesic distance of each data point to
this closest point. This is the distance from the cluster to all of the data points. These
distances are computed for all clusters. The last for-loop each data point considers each
data point and finds the cluster that is closest and assigns the data point to that cluster.
Code 26.18 displays the new AssignMembership function. Following it are the
Python commands to run the new k-means algorithm. Note that the ClusterAverage
function comes from the k-means module whereas the AssignMembership function uses
the newly defined function. Figure 26.5 displays the results from this modification. The
results show that the clusters tend to capture points along the spiral arm which is the
desired result.
The number of clusters in the k-means algorithm is established by the user and usually
with very little information. If too few clusters are created then variance in the clusters
396
Code 26.17 The Neighbors function.
1 # swissroll . py
2 def Neighbors ( data ) :
3 ND = len ( data )
4 d = np . zeros ( ( ND , ND ) , float )
5 for i in range ( ND ) :
6 for j in range ( i ) :
7 a = data [ i ] - data [ j ]
8 a = np . sqrt ( ( a * a ) . sum () )
9 d [i , j ] = d [j , i ] = a
10 return d
11
397
Code 26.18 The AssignMembership function.
1 # swissroll . py
2 def AssignMembership ( clusts , data , floyd ) :
3 mmb = []
4 NC = len ( clusts )
5 ND = len ( data )
6 for i in range ( NC ) :
7 mmb . append ( [] )
8 dists = np . zeros ( ( NC , ND ) , float )
9 for i in range ( NC ) :
10 d = np . zeros ( ND , float )
11 for j in range ( ND ) :
12 t = clusts [ i ] - data [ j ]
13 d [ j ] = np . sqrt ( sum ( t * t ) )
14 mn = d . argmin ()
15 mndist = d [ mn ]
16 dists [ i ] = mndist + floyd [ mn ]
17 for i in range ( ND ) :
18 mn = dists [: , i ]. argmin ()
19 mmb [ mn ]. append ( i )
20 return mmb
21
398
become large. This means that some clusters are collecting vectors that are not self-similar.
If there are too many clusters then some clusters are very similar to others. One method
of approaching this problem is to dynamically change the number of clusters. The system
needs to detect when there are too many or too few clusters and make the appropriate
adjustments.
The variance is measured by Equation (26.2) and remains small as long as the cluster
contains similar constituents. Dissimilar vectors will increase the variance, but Equation
(26.2) does not indicate which vector is the culprit. This can actually be determined but
if there is more than one outlier then the isolation of the outliers does not necessarily
indicate the necessary number of new clusters that are needed. Thus, a simple approach
is to detect that a cluster has a high variance and randomly split its vectors into two new
clusters and then allow the k-means iterations to sort it all out.
To detect if two clusters are similar the cluster average vectors are compared to one
another. If they are similar then the constituents of the two clusters can be combined into
a single cluster. This is also a very simple, but effective approach.
Code 26.19 generates a set of data with six seeds. The data is shown in Figure 26.6.
In this case two of the clusters overlap thus there are five blocks of data.
Even knowing that there are five clusters does not guarantee that the k-means will
cluster correctly. The results from Code 26.19 are shown in Figure 26.6. The first cluster
is marked by crosses and shares a block of data points with two other clusters. The
second cluster is marked with xs and includes two blocks of data. Even though the data
is inherently well separated the clustering did not perform as expected.
Code 26.20 shows the inter-cluster variances and then the intra-cluster differences.
In the latter case the cluster numbers are printed before the difference between them.
The inter-cluster variance is used measure the similarities within in a cluster and if it gets
too large then the cluster should be split. In the example case it is evident that the first
cluster should be split. Thus a threshold between 0.007 and 0.010 is needed to define the
clusters that need to split.
The intra-cluster difference measures the similarity between cluster average vectors.
If this is too small then the cluster vectors are close together and the clusters should be
joined. It is obvious from Figure 26.6 that clusters 2 and 4 should be combined. In Code
399
Figure 26.6: Five clusters data.
12 1 0 1.023
13 2 0 0.080
14 2 1 1.046
15 3 0 0.083
16 3 1 0.963
17 3 2 0.084
18 4 0 0.490
19 4 1 0.567
20 4 2 0.495
21 4 3 0.417
400
26.20 it is seen that the difference between these two cluster average vectors is 0.4 while
all other vector pairs have a distance greater than 1.
Dynamic clustering would then separate clusters mmb[1] and mmb[4] into two clus-
ters and combine clusters mmb[0], mmb[2], and mmb[3]. The splitting of a cluster is
performed randomly. Recall that mmb is a list and inside of it is a list for each cluster.
Randomly splitting a list involves creating two new lists and placing the constituents in
either one. In Code 26.21 m1 and m2 are the split of mmb[1]. Likewise m3 and m4 are
the split from mmb[4]. The m5 is the combination of the other three clusters. Figure 26.7
shows the results. The combination works well but the splitting was done randomly and
so vectors from both groups are in both clusters.
1 # kmeans . py
2 def Split ( mmbi ) :
3 m1 , m2 = [] , []
4 N = len ( mmbi )
5 for i in range ( N ) :
6 r = random . rand ()
7 if r < 0.5:
8 m1 . append ( mmbi [ i ] )
9 else :
10 m2 . append ( mmbi [ i ] )
11 return m1 , m2
12
The final step is to run the k-means as shown in Code 26.22. Figure 26.8 shows the
results which are more in line with the expected results.
As shown in the previous example k-means may not solve the simplest cases without some
aid. Or did it? The final solution shown in Figure 26.6 is better suited for the application.
In reality, the interpretation of the final results is completely up to the user. The danger
of using k-means (or any clustering algorithm) is to trust the results without testing.
Sometimes a different initialization will produce very different clusters. So, in designing a
problem that will be solved by k-means it is necessary to also design a test to see if the
401
Figure 26.7: New clusters after splitting and combining.
402
Figure 26.8: Clusters after running Code 26.22.
clusters are as desired. It may be necessary to compute new clusters, change the data,
change the algorithms, or split and combine clusters.
Finally, large problems may consume too much computer time and so a process
of hierarchical clustering can be employed. Basically, the data is clustered into a small
number of clusters (thus keeping computations to a minimum). Once those clusters are
computed the data in each can be clustered again into smaller sub-clusters.
26.6 Summary
Clustering is a generic class of algorithms that attempts to organize the data in terms
of self-similarity. This is a difficult task as similarity measures may be inadequate. One
of the most popular methods of clustering is k-means which requires the user to define
the number of desired clusters and the similarity metric. The algorithm iterates between
defining clusters and moving data vectors between clusters. It is a very easy algorithm to
implement and can often provide sufficient results.
However, more complicated problems will require modifications to the algorithm.
This will require the user to understand the nature of the data and to define data conver-
sions to improve performance.
Users should be very aware that there is no magic clustering algorithm. It is neces-
sary to understand the problem, the source and nature of the data, and to have expecta-
tions of results. Clustering results should be tested to determine if the clusters have the
403
desired properties as well.
Problems
1. Create a set of vectors of the form cos(0.1x + r) . Each vector should be N in length.
The x is the index (0,1,2,N -1) and r is a random number. Cluster these vectors
using k-means. Plot all of the vectors in a single cluster in a single graph. Repeat
for all clusters. Using these plots show that the k-means clustered.
2. Repeat Problem 1 using cos(0.1x) + r. Compute the clusters using k-means and
plot. Explain what the clustering algorithm did.
3. Modify k-means such that the measure of similarity between two vectors is not the
distance but the inner product.
5. Modify k-means so that it will cluster strings instead of vectors. Create many random
DNA strings of length N . Cluster these strings. Each cluster should have a set of
strings in which some of the elements are common. In other words, in the first cluster
contains a set of strings and all of the m-th elements are ’T’. For each cluster find
the positions in the strings that have the same letter.
6. Repeat Problem 5 but for each cluster find the positions in the strings that have
common letters. For example 75% of the m-th element in the strings in the n-th
cluster were ’A’.
7. Hierarchical clustering. Generated data similar to Figure 26.6. Run k-means for
K = 2. Treat each of the clusters as a new data set. Run k-means on each of the
new data sets. Plot the results in a fashion similar to Figure 26.8.
404
Chapter 27
Text Mining
Biological information tends to be more qualitative than quantitative. The result is that
a lot of the information is presented as textual descriptions rather than equations and
numbers. Thus, a field of mining biological texts for information is emerging. Like many
topics in this book this field is large in scope and evolving. Thus, only a few introduc-
tory topics are presented here and readers desiring more information should considered
resources dedicated solely to this topic.
27.1 Introduction
The goal of text mining in this chapter is to extract information from written documents.
While that sounds fairly straight forward, it is in fact a difficult task. A scientific document
presents information in many different forms: text, equations, tables, figures, images, etc.
Each of these requires a separate method of extracting and understanding the information
from the text. For this chapter the concern will be limited to only the text.
Even if the text is extracted and statistically analyzed it is not a direct path to
grasp the content contained within the document. The document offers text but there is
still the desire is to extract an understanding of the ideas therein. This is a most difficult
task that has kept researchers busy for several decades, and will continue to do so. This
chapter will consider simple methods of comparing documents and thereby associating
documents. This is only the basics of a burgeoning field.
27.2 Data
The data set starts with written texts which are now abundantly available from web
resources such as CiteSeer. These are commonly provided as PDF documents which need
to be converted to text files so they can be loaded into Python routines. Some PDF files
405
allow the user to save the file as a text and some will allow the user to copy and paste the
text into a simple text editor. There are also programs available that will convert PDF
files into text files. Programs such as pyPdf[Fenniak, 2011] can be employed to read PDF
files directly in to a Python program.
The text file will contain more than just the text. Symbols will appear where
the original text had equations or images. Furthermore, the text contains punctuation,
capitalizations, and non-alphabetic characters. Since the purpose is to associate text
between documents it is necessary to remove many of these spurious characters.
Code 27.1 shows the Hoover function which cleans up the text string. Line 3
converts all letters to lower case and line 4 converts all newline characters to spaces. Each
letter has an ASCII integer equivalent. The space character is 32 and ’a-z’ is 97-122. The
chr function converts the integer into a character. This function replaces all characters
that do have the correct ASCII code with an empty string, effectively removing these
characters. This step can easily remove more than 10% of the characters from the original
string.
Python offers several tools that can manipulate long strings of data and the fastest is the
dictionary. For example it may be desired to know the location of every word in the text.
Each word is used as a key and the data for each key is a list of the locations of that
word. The function AllWordDict in Code 27.2 creates a dictionary dct in line 3 which
considers each word in the list work. If the word is not in the dictionary then an entry
is created in line 10 using the word as the key and a list containing the variable i as the
data. If the word is already in the dictionary then the list is appended with the value i
in line 8,
406
Code 27.2 The AllWordDict function.
1 # miner . py
2 def AllWordDict ( txt ) :
3 dct = { }
4 work = txt . split ()
5 for i in range ( len ( work ) ) :
6 wd = work [ i ]
7 if wd in dct :
8 dct [ wd ]. append ( i )
9 else :
10 dct [ wd ] = [ i ]
11 return dct
12
It should be noted that the variable i is the location in the list work and not a
location in the string. For the example text used the work ‘primitives’ appeared in three
locations. In Code 27.3 the first 100 characters of the text are shown and the entry from
the dictionary for the word ‘primitives’ is also shown. As can be see the first returned
value is 1 which corresponds to the second word in the text and not a position in the string.
In many text mining procedures the distance between two words a and b is measured by
the number of words between them instead of the number of characters between them.
In this example there are 745 individual words. However, many of them are simple
words such as ‘and’, ‘of’, ‘the’, etc. which are not useful. Another concern is that some
words are similar except for their ending: ‘computations’, ‘computational’, etc. Dealing
with these issues is rather involved and for the current discussion a simple approach is
used which can be replaced later. In this simple approach only the first five letters of the
words are used. Words that are shorter than five letters are discarded and words with the
same first five letters are considered to be the same word. This is horrendously simple and
certainly not the approach that a professional system would use. However, this chapter is
designed to demonstrate methods of relating documents and not as concerned with word
407
stemming. Thus, the simple method, which does perform well enough, is favored over a
more involved but significantly better method.
Code 27.4 shows the function FiveLetterDict which modifies AllWordDict to
include only words of five letters or more and to only consider the first five letters. The
number of entries in this dictionary used in this example are nearly half that of the previous
dictionary.
The use of the first five letters is a very simple (and poorly) performing solution to a
complicated problem. This section presents a few other approaches that could be used.
Porter Stemming[Porter, 2011] is a method that attempts to remove suffixes from English
words. This procedure attempts to remove or replace common suffixes such as -ing, -ed,
-ize, -ance, etc. This is not an easy task as the rules do not remain constant. For example,
the word ‘petting’ should be reduced to ‘pet’ where as ‘billing’ reduces to ‘bill’. In one
case one of the double consonants is removed and in the other case it is not. Still more
confounding are words that have one of the target suffixes but it is part of the root word,
such as ‘string’ which ends with ‘ing’.
408
Computer code for almost any language is found at[Porter, 2011] including Python
code. While this program works well for many different words it is not perfect. Code
27.5 shows some of the more disappointing results. These are not shown to belittle the
Porter Stemming but rather to demonstrate that algorithms do not perform perfectly and
the reader should be away of performance issues of programs that they use. Many more
example words were properly stemmed and this example merely shows that stemming is
a very difficult task.
409
27.4.2 Suffix Trees
A suffix tree is a common tool for organizing string information. There several flavors
of suffix trees and so the one used here is designed for identifying suffixes that can be
removed. Given a string of letters a suffix tree builds branches at locations in which words
begin to differ. An example is (‘battle’, ‘batter’, ‘bats’) in Figure 27.1. In this case the
first node in the tree is ‘bat’ because all words in the list begin with ‘bat’. At that point
there is a split in the tree as some words have a ‘t’ for the fourth letter and another has
an ‘s’. Along the ‘t’ branch there is another split at the next position. The goal would
be to identify groups of nodes that commonly following a stem. In this case, three of the
four subsequent nodes are common suffixes and the other node (‘t’) is a common addition
before some stems.
In this section the simple task of comparing documents according to word frequencies
is considered. Certainly, document analysis is a far more complicated topic and readers
interested in this topic are encouraged to examine research that exceeds the scope of this
text.
The tasks to be accomplished here are to extract the frequencies of words, to find
words that are seen more (or less) frequently than normal, and to isolate words that are
indicative of the topic.
410
27.5.1 Data
Data consists of documents concerning at least two different topics. In this example the
topics are actin fibers and CBIR (content based image retrieval). These are very different
topics and so there should be a set of words that strong indicate which topic a document
discusses.
The phrase positive documents is applied to those documents concerning a target
topic. In this case, actin fibers are considered to the positive topic and thus all documents
concerning this topic are considered to be positive. All other documents are considered
to be negative. In this case there is only one other topic, but usually there are multiple
documents and so the negative documents are all topics that are not actin fibers.
To facilitate organization documents for a single topic are located in a single direc-
tory. Thus, it is easy to load all documents from a single topic. Each directory should have
several documents and the function AllDcts shown in Code 27.6 gathers the dictionaries
for all documents in a single directory. It receives two arguments of which the first is a list
of dictionaries. Initially, this is an empty list but as more topics are consider it grows. The
second argument is a specific directory. The function is called in line 11 with an empty
dictionary as the input. This creates all of the dictionaries for the actin topic. As seen
there are 23 documents. The second call to AllDcts pursues the CBIR documents and
the list dcts grows to 48. This process can continue for each topic.
10 >>> dcts = []
11 >>> miner . AllDcts ( dcts , ' data / mining / actin ' )
12 >>> len ( dcts )
13 23
14 >>> miner . AllDcts ( dcts , ' data / mining / cbir ' )
15 >>> len ( dcts )
16 48
The final result is a list of dictionaries. In this case it is known that the first 23
dictionaries are related to actin documents and the next 25 are related to CBIR documents.
411
27.5.2 Word Frequency
The word frequency matrix wfm will contain the frequency of each word in each document
with wfm[i,j] equal to the j-th word in the i-th document. The construction of wfm
begins with the word count matrix wcm which collects the number of times the j-th word
is seen in the i-th document. However, each document has a different set of words and
so it is prudent to collect the list of words from all documents before allocating space for
wcm.
The word list is created from GoodWords shown in Code 27.7. This program loops
through the individual dictionaries and collects all of the words into gw. Since words can
appear in more than one document the set function is used to pare the list down to one
copy of each individual word. The list@list function is used to convert the set back to a
list for processing in subsequent functions. In all of the documents that were considered
there were 8028 unique words that had five or more letters.
412
Code 27.8 The WordCountMat function.
1 # miner . py
2 def WordCountMat ( dcts ) :
3 ND = len ( dcts )
4 LW = len ( goodwords )
5 wcmat = np . zeros ( ( ND , LW ) , int )
6 for i in range ( ND ) :
7 for j in range ( LW ) :
8 if goodwords [ j ] in dcts [ i ]:
9 wcmat [i , j ] = len ( dcts [ i ][ goodwords [ j ]] )
10 return wcmat
11
413
words such as ‘the’ and ‘and’ are not included. The location of this word is at position
6380 and as seen the most common word is ‘image’. This makes sense since the second
topic is on a type of image analysis, and ‘image’ is a common word that could easily appear
in the actin documents.
The frequency of a word is the number of times a word is seen divided by the total
number of words. This is defined as the probability,
Ci,j
P (Wi:j ) = P , (27.1)
j Ci,j
where Ci,j is the i, j component of the word count matrix represented by wcm in the Codes.
Code 27.10 shows the WordFreqMatrix function which converts the wcm to the word
frequency matrix wfm by performing the first order normalization from Equation (27.1)
on each row of the matrix. This shows that document[9] contains almost 75% of the
occurrences of ‘wasps’.
The next step determines if this frequency is above or below the normal which
requires the average of each column be computed. The probability of a word occurring in
any document is computed by,
P
Fi,j
P (Wj ) = P i , (27.2)
i,j Fi,j
where Fi,j is the word frequency matrix represented by wfm in Code 27.10.
The WordProb function in Code 27.11 normalizes each column and computes the
probability of each word occurring in any document. The results for three of the words are
printed and the word ‘wasps’ has a probability of appearing in any document of 3.54×10−3 .
With normalized data in hand it is possible to relate one word to another and
therefore relate sets of documents to each other.
The overall goal is to classify documents based on their word frequency. Thus, the search is
for documents that have a set of words that are seen frequently in the positive documents
414
Code 27.10 The WordFreqMatrix function.
1 # miner . py
2 def WordFreqMatrix ( wcmat ) :
3 V = len ( wcmat )
4 pmat = zeros ( wcmat . shape , float )
5 for i in range ( V ) :
6 pmat [ i ] = wcmat [ i ]/ float ( wcmat [ i ]. sum () )
7 return pmat
8
415
and quite rarely in the negative documents. Furthermore, it is desired that the positive
words appear in a large number of positive documents.
Using the normalized data this process is performed by the IndicWords function
shown in Code 27.12. It computes the word frequency matrix and the probability vector
and then creates two new vectors. The first is pscores which are the scores of the positive
documents. These scores will be high if the words appear mostly in the positive documents
and in several positive documents. The counter to this is nscores which is the same score
for the negative documents. The final score is the ratio of the two with a little value p
included to prevent divide by zero errors. The output is a vector of scores. The highest
score is the most indicative word.
The function is called in Code 27.13 and the highest score is shown to be 603. The
scores are sorted in reversed order so that the highest scores are first in the list. Thus,
the word with the highest score is word 7125 which is shown to be the word ‘actin’. This
makes sense since that this the topic in the positive documents and it is not a word that
should coincidentally appear in the CBIR documents.
416
27.5.4 Document Classification
The final step is to classify a document. This step should be applied to a new document
rather than the training documents. That is left as an exercise for the readers. Instead
the scores of the training documents are computed.
The process is simply to accumulate the scores of the words that are in a document.
This is a very simple approach and certainly more involved scoring algorithms can be
created. This method, for example, does not consider how many times a word is in a
document. This process also does not consider the length of the document. Code 27.14
shows the process. The value sc is the score and is set to 0. Starting in line 2 the dictionary
from the first document is considered. The index ndx of each word in the dictionary is
obtained and the score for that index is accumulated. The score for this document is
almost 4883. This is a positive document.
1 >>> sc = 0
2 >>> for k in dcts [0]. keys () :
3 ndx = gw . index ( k )
4 sc += scores [ ndx ]
5 >>> sc
6 4882 .9 16 20 08 93 87 92
7
8 >>> sc = 0
9 >>> for k in dcts [25]. keys () :
10 ndx = gw . index ( k )
11 sc += scores [ ndx ]
12 >>> sc
13 473.63457 352814 18
A negative document is considered in the second half of Code 27.14. This is a CBIR
document and as seen the score is quite low.
The real goal is to consider a new document that has not been classified by the
reader. The document is cleaned and its dictionary is created using the same steps as
above. Then the score is computed. If the score is high then it is considered to be a
positive document.
27.6 Summary
The process outlined above was simple compared to more professional approaches. In
its simplicity there were several shortcuts that were taken, yet the process still shows a
great potential for crudely classifying documents. Improvements can include stemming
417
and better scoring algorithms.
This process only considered word frequency and did not consider the relationships
between words. Is there more information if common pairings of words are considered?
Obviously, syntax and meaning were not even discussed but could be valuable tools in
text mining.
Problems
1. The GoodWords function reduced the number of words. Compute the percentage
of words that were kept by this process compared to the original number of unique
words.
2. Gather documents for a single document. What is the most common word in those
documents?
3. Do the frequencies of simple words (’a’, ’the’, ’and’, etc.) change for documents
concerning different topics?
4. Repeat the process that ended in Code 27.14 but use the CBIR documents as the
positive documents. Compute the scores for the same two documents that were
scored in 27.14.
5. Consider two similar topics. Collect documents on two topics ‘gene sequencing’ and
‘gene functionality’. Determine the score (such as in Code 27.14) for each document
and declare if this method can classify closely related documents.
6. Consider two similar topics. Collect documents on two topics ‘gene sequencing’
and ‘gene functionality’. What are the strongly indicative words that can separate
documents for this case? Are these words indicative of the topics?
418
Part IV
Database
419
Chapter 28
Data sets often have multiple entities and efficient queries of that data require that the
data be organized. Consider a biology case in which the data from a species includes
several genomes, each with thousands of genes, each with similarities to other genes, and
a list of publications that deal with the genes. To complicate matters, the publications
could cover a small subset of genes from different genomes. The data set is complicated
and a myriad of queries can be conceived.
This data set contains different data types and various connections between the
data. It is possible to store the data in a flat file which is basically all of the data in one
table, but that would be very inefficient and highly prone to errors. A more organized
approach would be to have tables of data dedicated to individual data types or connections.
For example, one table (or page in a spreadsheet) would contain information about the
genomes. Another table would contain information about the individual genes. Both of
these tables are dedicated to storing information about a unique type of data. A third
table would then be used to contain information about which gene is in each genome. This
table is dedicated to the connections between data types.
Certainly, a spreadsheet could archive several tables of various data types as long
as the amount of data does not exceed the limits of the spreadsheet program. It is also
possible to ask questions of the data using the spreadsheet. However, as the questions
become complicated the inherent limitations of a spreadsheet become apparent. Thus,
comes the need for a DBMS (database management system). This chapter will explore
the use of a spreadsheet in pursuing some queries and the following chapters will explore
the same issues with the use of a DBMS.
A DBMS offers are more utility than just the pursuit of a complicated query. There
are several issues that are raised when dealing with large data, dynamic data, several users
and so on. A spreadsheet program has strict limits when dealing with some of these issues.
These are:
421
locations in the tables. Such a case indicates that the database is poorly designed.
The main problem is that data changes and if it changes in one location but not
another then data disagrees with itself rendering the results to be unreliable.
2. Difficulty in Accession: If person A has the data how can person B get to it?
4. Integrity: Some types of data must stay within bounds. A zip code should have
five digits.
5. Atomicity: Account A sends $25 to account B. However, during the transfer there
is a fault in one of the computer systems. The money is subtracted from account
A but never added to account B. Such data transfers should be atomic. Either it
completely works or it is aborted.
6. Concurrent Access: Consider a case in which person A and person B both have
access to the same bank account. The account has $100 and both try to withdraw
$60 at the same time. The system should not allow both withdrawals.
7. Security: Who is allowed to see which data? Who is allowed to alter which data?
In order to demonstrate the functionality of the different methods of query a simple data
set is employed. The use of scientific databases often come with two different issues. The
first is the question of how to use the query systems and the second is to understand the
contents of the data. As the second concern is not important to the following chapters a
much simpler database will be used to demonstrate the different query commands.
This database is a extremely abridged movie database that contains only a few
movies and only a few actors (or actresses) from those movies. While it is very incomplete
it is sufficient to demonstrate the query processes.
Whether in a spreadsheet or a database, the data is stored in a set of tables. In a
spreadsheet this is a collection of pages in the file. The movie database has seven tables
in two categories. The first category is the collection of tables that contain collected data.
These are:
The movie table contains the name of the movies, the year released and a quality
grade.
The actor table contains the first and last names of the actors.
The country table contains the list of countries from which movies were filmed.
422
The lang table contains the list of languages in the movies.
The second category are the tables that connect the previous tables together. These
are:
In a spreadsheet each row has a unique identifier which is the row number. The
same uniqueness applies to databases in which each table must have a primary key. So,
one column of data is designated as this key and it must contain unique data. In the case
of movies none of the data fields qualify. There are movies with the same name, movies
that are released in the same year and movies that have the same grade. Therefore it is
necessary to have a new field which is simply unique integers that are this primary key.
In a spreadsheet this field looks redundant to the row numbers, but as this data will be
migrated to a database in later chapters the primary key is included for all tables.
The beginning of the movie table is shown in Figure 28.1. There are for fields or
columns. The first is the primary key, mid, which is just incrementing integers. The other
three columns are the name, year and grade. As seen, not all of the fields have values.
While the data is available it is not included to simulate cases of missing data which is
common in a lot of data collection.
Figure 28.2 shows the beginning of the actor data. There are three columns which
are the aid (actor ID), first and last names. The data is not sorted in any particular
order. Sorting will be performed during the query. As actors are added to the database
they will be appended to the bottom of the list. It is important that association of actors
with their aid not be changed which means that the data is usually stored in order that
it was collected rather than a sorted order.
It is possible to store this data in a single table. For example a table could be created
that has the movie name, year, grade, and several columns for the names of the actors.
Such a design causes issues. The first is that the number of actors is not a set number and
in fact does not have a maximum value. The next movie recorded could have more actors
than any other movie to date. The second problem is that actors appear in many movies
and so their names would appear in multiple rows. It is possible that one entry could be
423
Figure 28.2: The actor data.
misspelled and then the actor has two different names in the database. One rule of thumb
for designing a database is that the data should not be duplicated. So, the actor’s name
should appear only once as it does in the current design. The third problem involves the
design of the queries. In the proposed flat file it would be easy to find the names of the
actors in a single movie, but it would be cumbersome to find all of the movies from a
single actor.
The proper design then creates one table that contains information about individual
movies. The data contained there have single entries in each field. In other words, this
includes information such as name, year and grade of which each movie only has one value.
Information that has multiple entries such as the list of actors, countries used in filming
and languages are then placed in other tables. The actor table contains information that
is unique to each actor. In this case, that is their names, but a more extensive database
would contain a birth date, birth location, and other information of which each actor has
only one entry.
The connection of the movies and actors is contained in the isin table shown in
Figure 28.3. This table contains three columns. The first is iid which is the primary
key and merely incrementing values. The other two are the mid and aid which relate the
movie ID to the actor ID. The first entry has mid = 3 and aid = 4. In the movie table
the mid of 3 is the movie A Touch of Evil and the actor with aid = 4 is Orson Welles.
In this manner, the isin table connects the actors and movies. It is the same amount of
work to collect all of the actors in a given movie as it is to find all of the movies of a given
actor.
This database is very small and very incomplete. Here only a few actors are listed
for each movie and some movies have no actors in the database. Furthermore, readers
may disagree with the grade of some movies as this is merely an opinion garnered from
one of several sources.
The fields in each table are shown in Figure 31.1. Each block is a table and the first
424
entry is the primary key, which in this database is always an integer. The rest of the fields
are self-explanatory.
The isin table connects the movies and actors. In a similar fashion the inlang table
connects movies and languages through the mid and lid, and the incountry table connects
movies and countries through a cid. Now it is possible to answer questions such as: Which
countries were used in filming movies that starred Daniel Radcliffe. The query would start
with the name of the actor, fetch his aid, then his movie mid values, and from there the
languages of those movies.
Now that a database is in place it is possible to ask a series of questions or queries. These
queries will be used both in the spreadsheet and database chapters. The goal is to show
how such queries are approached and that spreadsheets are limited in their ability to
retrieve answers to queries.
The list is:
3. List the movies (name, grade, and mid) with a grade of 9 or 10.
4. List the name of the movies that were released in the 1950’s.
5. List the years of the movies that had a grade of 1, but list each year only once.
8. Compute the average and standard deviation of the length of the movie names.
9. List the first names of the actors whose last name is Keaton.
10. List the first and last names of the actors with the letters “John” in the first name.
11. List the first and last names of the actors that have two parts in the first name field.
13. List the last names in alphabetical order of all of the actors that have the first name
of John.
425
15. List the actors that have the letters “as” in the first name and sort by the location
of those letters.
17. Compute the average grade for each year and sort by the average grade.
18. Compute the average grade for each year and sort by the average grade but the year
must have more than five movies.
19. Return the names of all of the movies that had the actor with mid = 281.
20. Return the names of the movies which had John Goodman as an actor.
22. List the titles of all of the movies that are in French.
23. Without duplication, list the languages of the Peter Falk movies.
24. List the movies that have both Daniel Radcliffe and Maggie Smith.
25. List the other actors that are in the movies with Daniel Radcliffe and Maggie Smith.
26. List the mid and title of each movie that has the word “under” in the title along
with the aid of each actor in that movie. Thus, if there are five actors in the movie
then there will be five returned answers, each with that same movie and a different
actor.
27. Return the names of the five actors that are listed as having been in the most movies.
28. Return the names and average grade of the five best actors (those with the highest
average) grade that have been in at least five movies.
30. Using the Kevin Bacon Effect find the two actors that are the farthest apart.
Many of the queries can be answered through manual manipulation of the data in a
spreadsheet. Some of the queries are very difficult to accomplish in this manner. This
section will show how many of the above queries can be answered within the realm of a
spreadsheet. Some of the methods require human intervention which could easily become
untenable if the data set became large.
Query Q1 asks for the name of the movie with mid = 200. This is easily accom-
plished by just scrolling down the movie page until this mid is visible. The answer is Once
Upon a Time in the West.
426
Query Q2 seeks the movies that were released in 1955. There is more than one good
solution to this problem. One method would be to sort the data by the year and then
scroll to the entries from the desired year.
A second method is to use the filter function of the spreadsheet. The filter hides
the rows that do not pass the filter condition. In this case it is possible to set the filter
to show only the rows that have the year 1955. The other rows are not removed they
are merely hidden from view. Figure 28.4 shows the filter dialog in LibreOffice which is
obtained by the menu choices Data:More Filters:Standard Filter. In this case the user
selects to condition that row C must be equal to 1955. The result is shown in Figure 28.5
which shows only the rows where that condition is true. The other rows do exist but they
are hidden from view.
Query Q3 seeks the movies with the grade of 9 or better. This can also be ac-
complished by simply sorting or using the filter methods. However, this query requests
that only part of the information be shown and in a certain order. Once the rows have
been isolated by either method the user can manually rearrange the results by cutting
and pasting the columns in the desired order. While this is simple enough to do, it does
require that the user intervene with the query process. In other words, partial result is
obtained and then the user performs more steps to get to the desired result. The process
is not fully automated.
427
Query Q4 seeks the names of the movies from an entire decade. Again this can be
accomplished by sorting the data on the year or using the filter feature of the spreadsheet.
Query Q5 pursues the movies that have been assigned the lowly grade of 1. It is
possible that some years have multiple movies that have this grade and the query asks
that each year be listed only once. The advanced features of the filter are obtained by
selecting the Options box in the lower left of the filter dialog. This reveals a few options
of which one is to remove duplicates. For this query only the data in columns C and D are
used. These are copied to a new spreadsheet page and the filter shown in Figure 28.6 is
applied. The result is a few rows that shows the years in which there is a movie with a
grade of 1 and each year is shown only once.
Figure 28.6: Using the advanced features of the filter to remove duplicates.
Query Q6 is to return the number of actors in the movie that have mid = 200. This
information is obtained from the isin table. In this table there are entries from rows 2 to
2364 and the mid values are in column B. So the formula =COUNTIF(B2:B2364,200) will
count the number of rows in the table that have mid = 200. In this case there are 5.
To obtain the average grade of the movies from the 1950’s (Query Q7) the data in
the movie table can be sorted by year and then the user can select the rows from the
desired decade and use the AVERAGE function to compute the average over the grades
from just the selected years. Once again, the user performs one step to manipulate the
data and then performs a second step to get the final result. The user intervenes in the
process to obtain the desired answer.
Query Q8 seeks the average and standard deviation of the length of the movie names.
The length of a string in a cell is computed by the LEN command. Figure 28.7 shows the
use of the command in which the length of cell A2 is placed in B2. This formula is copied
428
downwards so that the lengths of all of the movie names have been put into column B.
Now the AVERAGE and STDEV functions are used to calculate the results for the values in
this column. The average length is just above 15 characters with a standard deviation
just over 8.
Query Q9 seeks the first names of the actors who have the last name of Keaton.
In a spreadsheet this can be accomplished by sorting actors on their last names and then
finding the Keatons or once again using the filter tool. In this database there are three
actors that fit this description: Michael, Diane and Buster. Query Q10 seeks actors that
have the letters “John” in their first name. This query is a bit different in that the first
name could also be Johnny or John Q. The filter tool does have the option of a cell
containing a particular value in the Condition pull down menu (see Figure 28.4). There
are, in fact, two actors that are named Johnny and one John C.
Query Q11 asks for actors that have two parts to their first name. This will include
people that have two names, one name and an initial or two initials. In all cases, the
two parts are separated by a space and so the search is for first names that contain a
space character. Once again, the filter tool is useful as it can search for a first name that
contains a space character. However, this will also return a few people that have only one
part to their first name. There are a few actors that have a space after their single name
and these are also returned. Of course, the best solution is that these spaces be removed
from the database. Reality, though, is that data can come from sources other than the
database users and the format of the data may not be to the user’s preference. So, as
an academic exercise, the spaces remain and it is up to the user to define a query that
excludes these individuals. The filter tool allows the user to search on multiple criteria as
shown in Figure 28.8. All three of the Value fields have a single blank space in them.
Query Q12 returns the actors with matching initials in their first and last names.
This requires that the first and last letters of each person be isolated. The function LEFT
grabs the left most characters in a string. To get the first letter in cell B2 the formula
is =LEFT(B2,1). Figure 28.9 shows the solution where cell D2 is the first initial of the
first name and cell E2 is the first initial of the last name. The formula in cell F2 is
=IF(D2=E2,1,0) which places a value of 1 in the cell if the initials match. Once this is
accomplished the data can be sorted on column F to collect the people with the matching
initials. It should be noted that in this method the user had to intervene in the middle of
the process. The sorting stage is applied after the column F is computed.
Query Q13 will list the last names of the actors with the first name of John. This
429
Figure 28.8: Finding individuals with two parts to their first names. Each of the Value fields
contains a single blank space.
430
listing is to be in alphabetical order. Figure 28.10 shows the sorting dialog in LibreOffice
that sorts on two conditions. The first is the sort on the first names which will collect all
of the John’s together and the second is a sort on the last name which alphabetizes the
John’s (as well as other first names) by their last names. The user then needs to find the
set of John actors and extract the results.
The LEN function is useful for Query Q14 which seeks the movies with the longest
title. In Cell E2 in the movie page the formula to get the length of the name is =LEN(B2).
This formula is copied down for all 800 movies, and then the user can sort on this new
column. Once again, the user must intervene with the process to complete the query.
Query Q15 is similar to previous queries in that the strings in a field are searched for
a particular letter combination. In this case that combination is “as”. However, it needs
to sort the results according to the position of this substring. This is accomplished with
the FIND function. In the actors table, the formula for cell D2 is =FIND("as",B2). In this
case there will be an error code returned because the name in B2 does not have the target
letters. This formula is copied down for all rows, and for those few rows which contain
and actor’s name that has the target letters a value appears. This value is the location of
the first letter of the target. Thus, for Nicholas Cage, the value in the E column becomes
7. The user can then sort on the E column.
Query Q16 seeks the average grade for each year. Certainly, the movie data can
be sorted by year. It is also possible that the user can select to compute the average
431
over a range of movies for a certain year. However, there are about 90 different years in
the database and the number of movies is different for each year. Thus, the user would
then have to write the equation to compute the average for each year as shown in Figure
28.11. That is too tedious, and not a good solution for cases that would have thousands
of segments rather than the 90 in this case.
Query Q17 builds on the previous query so that the data is sorted by the average
grade. If the user slugged through the process of the previous query then they could sort
on the average values that were computed. However, there is a catch. If the data is sorted
on a column that contains formulas then the cells that the formulas used will also be
changed. So, before the user sorts the data on the average grade those values will need to
be copied and pasted as values instead of formulas using Paste Special. This will convert
the formulas to static values and then sort can continue. This query is possible to do but
employs more than one instance of user intervention.
Figure 28.11: A portion of the window that shows the average for each year.
Query Q18 is the same as Q17 except that years that have less than five movies are
to be excluded from the results. The user can start with the spreadsheet used in Query
Q16 and simply eliminate those average calculations for years that have fewer than five
movies. This is doable for this example, but if the query had a thousand segments and
the minimum number was much larger than five then it would be a very tedious task for
the user. Furthermore, the user must be actively involved in seeking the answer.
Query Q19 starts a series of queries that use multiple tables. In this case the query
starts with the actor’s aid and seeks the name of the movies for this actor. This is a two
step process in which the aid is used to fetch the mid values from the isin table, and then
the mid values are used to fetch the movie titles from the movie table.
This query is still possible to do as shown in Figure 28.12. The data in the isin
table is processed by a filter that keeps only the rows in which aid = 281 and as seen
there are four. Column B contains the mid values and these need to be converted into
movie names. Cell D469 contains the formula =OFFSET(movie.B$1,D469,0) which relates
the mid to the movie name as long as the movie data is sorted by the mid. The OFFSET
command positions as cell B1 and then moves downwards with the number of rows being
specified by the value in cell D469. The third argument of 0 indicates that there is no
horizontal shift. If this value were 1 then the information shown in the cell would be from
432
the next column to the right which is the year of the movie.
Query Q20 takes this idea a step further by starting with the actor’s name. The
name is converted to the aid values in the actors table and then this information is
converted to movie titles as in Q19. The user is heavily involved in the steps of this query
as now two levels of OFFSET are needed. Query Q21 is similar except that there is one
more layer that when the movies are collected that the average grade be computed. In
this query, there is only one actor and therefore only one aid, so the difficulty is not really
elevated compared to Q19. In a case in which the combined average score of 100 actors is
requested, the level of complexity is increased as the transition from actor’s name to aid
needs to be automated.
Query Q22 is similar to Q20 except that the query starts with a language and goes
through the langs and inlang tables rather than actors and isin tables.
Query Q23 starts with the actor’s name and ends with the languages of the actors.
Thus, it uses in order the actors, isin, inlang and langs tables. There is also a caveat that
the languages be listed only once which can be accomplished with the filter tool using the
option to remove duplicates as in Q5.
The logic changes somewhat with Query Q24 which seeks the names of the movies
that star two individuals. The logic flow is shown in Figure 28.13. The box named actor1
starts with the first and last names for the first actor (Maggie Smith) which then converts
this to her aid. The containing box represents the information that is available in the
actor table and the use of the integer in the table merely separates it from the second
use of the table shown directly below. The actor2 table follows the same logic but for
the second actor Daniel Radcliffe. The mid of each actor is converted to their personal
mid values through seperate uses of the table. Then intersection of their mid values are
obtained and sent to the movie table to get the names of the movies.
Query Q19 demonstrated how the actor’s name is converted to an mid, and that
process is used twice in Q24. Finding the common mid values is shown in Figure 28.14.
The first two columns are the mid values from their movies. The formula in cell C2 is
=MATCH(B2,A$2:A$6,0) which returns the location of the match for cell B2. In other
words, the value in B2 is the first item in the list in column A. The next two movies also
find matches, but the rest do not which is indicated by #N/A. The formula in cell D2 is
=OFFSET(A$2,C2-1,0) and this returns the value of the match. Thus, all of the values in
column D are the mid values of the movies that both actors are in. The spreadsheet filter
433
Figure 28.13: The logic flow for obtaining the name of a movie from two actors.
can be used to isolate those from the #N/A entries. Now, that the common movies are
found the process of Query Q1 can be used to extract the names of the movies.
Query Q25 extends this and instead of retrieving the names of the movies, the mid
values would be used to get the actor aid values and then their names. While this query
can be accomplished in a spreadsheet, there are several parts of the query that require
user intervention.
Query Q26 seeks movies with the letters “under” in the title. The twist is that it
also needs to return the aid of the actors in that movie. If a movie has five actors then
the answer should list the movie five times with each time showing a different aid. In a
spreadsheet this challenge starts with the movie title and converts that to the mid then
to multiple aid values and then to actor’s names. The user is heavily involved in walking
this process through the spreadsheet data.
Query Q27 seeks the five actors that have been in the most number of movies. This
requires that the number of movies for each actor be known. It is possible to sort the isin
table on the aid values and then to count the number of rows for each actor aid. This is
similar to the computation of the average grade for each year, in that it is a doable but
very tedious task. Once the number of movies for each actor is known then the user can
sort on those counts.
Query Q28 seeks the average grade which means that the average grade for each
actor must be computed. Furthermore, the user needs to exclude actors with too few
434
movies. Again this is a very tedious task that would be untenable for larger data sets.
Another approach is shown in Figure 28.15 that compares values in multiple columns.
The formula in cell D2 is =COUNTIF(C$2:C$2338,A2). This counts the number of entries
in column C that has the same value as cell A2. The purpose is to count the number of
movies for each actor and since the values in that column are coincidentally the same as
the aid values, this computation also counts the number of movies for the actor with aid
= 1. This formula is copied down and the next step would be to find the maximum value
in this column.
Query Q29 seeks the average grade for each decade which, in a spreadsheet, is easier
than the average grade for each year as there are fewer divisions. So the process of Q16 is
repeated with different divisions.
Query Q30 deals with the Kevin Bacon Effect which follows the links between two
actors through common movies. The idea is that one actor has been in a set of movies
which has other actors. Those actors have a set of movies which have different actors.
This process continues until one of the actors is Kevin Bacon. To get the path from a
single actor to Kevin Bacon is tedious but tenable with a spreadsheet. The final query,
however, searches for the shortest such path between any two actors and this is a job for
a computer program.
Most of the queries in the list can be accomplished with a spreadsheet. Some of the
queries, however, are only workable if the data set is small. Some queries require the user
intervention. Intermediate results are returned and then the user must perform an action
such as a filter, a search or a sort. From that process the final answer becomes available.
Thus, the query is not fully automated.
A database management program such as MySQL offers several advantages over a
spreadsheet. These include the ability to have several users and security. It also offers the
advantage of fully automating complex queries. As to be seen in the next chapters, each
of the above queries can be converted to a single MySQL command that returns the final
result. Once the command is written, user intervention is not required.
435
Problems
1. In a the movie spreadsheet get the actors with an aid between 95 and 100 (inclusive).
2. Using the spreadsheet return a list of actors that have only one of the two name
fields with an entry. Some actors go by a single name and so only one field is used.
3. Return an alphabetized list of the last names of actors that have George as a first
name.
4. Return an alphabetized list of the first names of actors that have Mason as a last
name.
5. Using the spreadsheet determine if there are any movies that have actors from both
of the lists in the previous two questions. Basically, is there a movie with one actor
having a first name of George and another actor having the last name of Mason.
6. Using the spreadsheet find the list of languages from movies that are made in Mexico.
This list should not have duplicates.
8. What is the year of the earliest movie not made in the USA?
9. Return a list of actors that are in movies that have German as a language. This list
should include first and last names, be alphabetized on the last names, and have no
duplicates.
11. Which actor has the most number of languages associated with their movies?
436
Chapter 29
There are several options for storing in a database. The website http://db-engines.
com/en/ranking lists almost 300 engines different products according to their popularity.
This chapter will review just three of these as they are viable products for the following
chapters. For each there will be examples on how to load the data, perform the queries
and transfer the data to another program such as a word processor. Creating queries to
perform specified tasks is reserved for the following chapters.
The previous chapter explored the use of a spreadsheet for storing data and performing
queries. For small data sets and mild queries a spreadsheet offers a good platform. How-
ever, spreadsheets will falter as the requirements are increased. While spreadsheets are
now allowing multiple user access through cloud services they still lack access control. A
DBMS (database management system) can control what each user can read and what
each user can write. This includes controlling the access in different manners on the same
table. A DBMS is also capable of accessing data that is distributed among many servers.
For large or critical database, distribution is essential.
For the chapters in this book, however, the most important advantage of a DBMS
over a spreadsheet is the ability to automate complicated queries. Excepting the last two
queries, all of the queries in the list in the previous chapter can be performed in a single
command.
In a spreadsheet all of the data is stored on pages with a two-dimensional array of cells.
Databases hold to this philosophy by placing all data into tables. Each table has fields
437
which contain a single data type. These are similar to the columns used in the movie
database. A field has an associated data type so for example the movie grade can be
contained as integers rather than a string. Each row is called a record or tuple.
Each table must also have a primary key. This is a field in which there are no
duplicated entries. In the case of the movie database the names of the movies could not
be used as a primary key because there are movies with exactly the same title. Likewise,
the years and grades of the movies could not be used as a primary key. As is common
practice, the primary key is an additional column with incrementing integers. This is the
first column in the table. All of the tables in this database use this same philosophy and
have a field on incrementing integers to be the primary key.
Designing a table for a database is important as an improperly designed table will
make queries difficult to construct and could slow the response time. Consider again the
movie database in which a single movie has a year, a grade, several actors, languages, and
countries in which it was filmed. Since a movie has a single name, year and grade these
items could be placed in the same table. The number of actors varies and the same actor
can appear in multiple movies. This is sufficient to require the actors to be contained in
a separate table. The same logic applies to the languages and countries.
Previous chapters explored the use of Python for manipulating information. If the
movie information were stored in Python then one might consider keeping all of the infor-
mation in a list such as [name, year, grade, [actors], [languages], [countries]
]. In this case lists are used to store the information about actors since the number of
actors varies. The rule of thumb is that if it is convenient to store the information in a
list in Python then a new table is needed when storing the information in a database.
The schema of a database is the set of tables and their connections. The schema for
the movie database is shown in Figure 29.1. Each table has a name and in the white boxes
are the names of the fields. The primary keys are denoted as well. The lines connecting
them show the fields that represent the same type of data. For example, both the actor
and isin tables have the actor’s ID. Both of these are labeled aid, but that is merely a
convenience. It is possible that the two fields could have the different names but still
represent the actor’s ID. The line connecting them shows that these two fields represent
the same data. The schema does show that it is possible to travel from any table to any
other table although passing through intermediate tables may be required. In this manner,
the user can see that it is possible to create a query with data from one (or several) tables
and retrieve data from any other table.
There are many DBMS systems available with some being freely so. The most common
are Oracle, MySQL, Microsoft SQL Server, and PostgreSQL. Some of these systems are
designed for industrial data sets while others are designed for personal uses. The three
systems that are reviewed here are sufficient for the rest of the chapters. These are
438
Figure 29.1: The movie database schema.
Microsoft Access, LibreOffice Base and MySQL. All of these products can host a database
or act as a client and connect to a server that contains the data. Furthermore, these three
products all use the MySQL language, so the following examples will work in any of these
environments.
An example, query is shown in Code 29.1 which retrieves the information about the
movie that has the movie ID mid = 200. This query is used here to show how to access
data through the different products. Explanations concerning the components of queries
follow in the next two chapters.
The ensuing subsections show how to establish a table, upload data, submit a query
and copy the data for the Microsoft Access, LibreOffice Base and MySQL.
Microsoft Access is a part of the Microsoft Office suite that manages a database. It has
the capability to manage a local database or connect with a database on a server. It
is a personal database manager with some limitations. There are versions of the Office
suite that work on Windows and OSx but not directly on UNIX platforms. There is
a 2 GB limit on the amount of data and a 32K limit on the number of objects in the
database.[Corp., 2016]
Access does have a graphical interface which is useful for non-expert database users.
While it does have many features only the basic steps are shown here which are sufficient
to load data and present a query. Users intending on using this product are encouraged to
read more detailed manuals to gain insight into the full capabilities of Microsoft Access.
439
When Access is started the user is presented with several choices as shown in Figure
29.2. In this case, a new blank database is started and so the first selection is used. One
major convenience of Access is that it can easily create a database by importing data from
Excel. In the following example, the movie data spreadsheet movies.xlsx is used. Figure
29.3 shows the selections to import this data.
A new dialog appears that offers the user choices on how to import the data as seen
in Figure 29.4. The first choice is to create a new table in the database which is the desired
path for this example. The second choice is useful later when data is to be added to a
database table.
The Excel spreadsheet has many pages and each one will be imported individually.
The next dialog that appears allows the user to select the page from the spreadsheet that
is to be imported. The following dialog allows the user to select if the first row contains
the column headings. In this case, the first row of the spreadsheet is the name of the
columns and so the box in the dialog should be checked. If the first row in the spreadsheet
page was the first row of data then this box would be left blank.
Data in a spreadsheet is usually considered to be a string or a double precision float.
The database, on the other hand, has many more data types that can be used, so the user
will need to intervene to select the correct data types for the importing data. The ensuing
dialog is shown in Figure 29.5 which is only the top portion of the dialog. In the movie
page of the spreadsheet there are four columns of data: mid, name, year and grade. The
data type for each of these needs to be established. The figure shows the selection for the
mid field. The user changes the Data Type to Integer as shown. The name column should
be a 100 length VARCHAR, and the other two columns are selected to be integers.
Every table in a database needs to have a primary key. The next dialog allows the
user to select if Access will create a primary key table or if the imported data has the
primary key. In this case, the mid field is the primary key and so the second option of
“Choose my own primary key” is selected and the user selects which field is to be used as
a primary key. The final selection is the name of the table in the database. The default
is that it will be the same name as the page in the spreadsheet. However, the user can
440
Figure 29.3: Importing from Excel.
441
Figure 29.4: Importing choices.
442
alter that choice. In this example, the names of the pages in the spreadsheet are also the
names of the tables in the database and so the default values are used.
This concludes the intervention required to import the data from the movie page to
the database. The process needs to be repeated for every page in the spreadsheet. Once
all of the data is uploaded the user should save the file in Access. This is a single file that
can be copied to other computers and a double click on the file icon will start Access and
load the data.
After all of the data is loaded, it is possible to create queries. Figure 29.6 shows the
query selection window that is a graphical interface for creating a query in which the user
makes selection and Access converts the selections into a MySQL query. In this case, only
four of the tables have been loaded and the user can select the tables to use. The fields
behind can be filled in to create a query. However, this process is slow and it is much
more efficient to just write the MySQL query command.
At the top of the main window there are several tabs of which one of them is the
Query tab. A right-click on this tab brings forth a small menu as shown in Figure 29.7.
The last selection is SQL View which converts the screen to a window where the user
can type in the command directly. The user can then enter the MySQL command in the
window as shown in Figure 29.8.
When the query is executed it returns a table with the response. This is a simple
table format and the data can be painted with the mouse, copied and pasted into a Word
document or an Excel spreadsheet.
While Access has many functions, the ones shown here are the basics on how to
load data from a spreadsheet and perform a query. Users interested in using this product
should invest in reading other resources to learn the capabilities of Microsoft Access.
443
Figure 29.7: Converting the the MySQL command view.
444
29.3.2 LibreOffice Base
Another choice for a personal database manager is LibreOffice Base. It is similar to Access
in that it provides the ability to host a database on the local computer or access one on
another machine. Some of the advantages of Base is that it is freely available with the
LibreOffice suite and it runs on UNIX as well as Windows and OSx.
Some installations of LibreOffice Base return an error indicating that the user needs
to install JRE (Java runtime environment). This is an unfortunate error as the solution
is slightly different than the error indicates. There are two parts to the solution. The first
part is that the user needs to have JDK (Java Development Kit) installed. Furthermore,
this needs to be the 32-bit version of JDK as LibreOffice is a 32-bit program. The second
part of the solution is that LibreOffice needs to be connected to the JDK. A computer may
have more than one installation of JDK and so it is necessary to select the correct version.
In the Base program the user selects Tools:Options. A new dialog appears and on the left
the user selects LibreOffice:Advanced. In the Vendor panel the user can connect to any
of the JDK systems that are installed. Once connected to the 32-bit version LibreOffice
needs to be restarted.
The initial dialog that appears after starting the program asks the user if they are
starting a new database or connecting to an existing one. Once again the “Create a new
database” selection is chosen. This leads to the next dialog which asks the user to register
the database. Following this is a dialog where the user decides on the name and location
of the file that will be saved. This file will be the database with the extension odb and can
be copied and used on other machines that have LibreOffice installed.
The next window that the user sees is the main interface as shown in Figure 29.9.
Initially, the Tables frame is empty. To load a table the user opens the spreadsheet that
contains the data. There the data to be loaded is painted and copied to the clipboard.
Then the user goes to the database dialog and right clicks on an empty space in the Tables
frame. There are several options and the one to select is Paste.
After Paste is selected the Copy Table dialog appears as shown in Figure 29.10.
Here the user selects the name of the table to be created in the database and if the first
line of the data is to be the field names in the database. In this case, the movie table is
being imported. The user selects the use of the first line as column names if they were in
the copied data.
The next table allows the user to select which columns in the spreadsheet are to be
copied into the database. In this case, all of them are and so the >> is selected and all of
the entries in the left pane are moved to the right pane. This is shown in Figure 29.11.
Figure 29.12 shows the next dialog in which the user defines each of the fields. In
this image the mid field is changed to the Integer data type. Before the user moves on to
the next dialog, the data type for all four of the fields needs to be set.
This is sufficient to upload the data. The user will be asked to automatically set a
primary key column which in this case is rejected since the mid data is being uploaded. To
445
Figure 29.9: The initial dialog.
446
Figure 29.12: Setting the data types.
set the primary key the user right clicks on the movie icon in the database Table window.
Then the user selects the row to be set by a right click on the gray box to the left of the
field name as shown in Figure 29.13. Now, the primary key is set and the user can then
repeat the process for the other pages in the spreadsheet.
Once the process has been applied to all of the tables in the spreadsheet the main
dialog appears like the image in Figure 29.14. Now the user is ready to generate a query.
This starts with the selection of the Queries button in the Database panel on the left.
The choices in the Tasks panel change and the last one creates a window for the
user to enter in the MySQL command directly. Once the command is entered then the
query returns results in a table. Unfortunately, moving the results to a word processor
is not as easy as copy and paste. Figure 29.15 shows that the user selects the first and
last gray boxes on the left column to paint the rows of data to be copied. Then the user
right clicks on one of those gray boxes to get a popup menu that has a copy option. Now
the data can be copied into a spreadsheet but not directly to a word processor, but it is
possible to copy from the spreadsheet to the word processor.
447
Figure 29.14: The main dialog.
448
29.3.3 MySQL
Successful access to the MySQL system will be rewarded with the prompt mysql>.
Now the system is ready to receive a query command.
The next step is create the tables and upload the data. Every user has a set of
privileges, and it is possible that user may not have privileges to upload data to the
database. The MySQL administrator can change these privileges or find other avenues to
upload the data.
Assuming that the user has sufficient privileges to create tables and to upload data
then the following steps will load data from a spreadsheet to the MySQL database. There
are many other methods to upload data. The first step is to convert each page in the
spreadsheet to a tab delimited CSV file. The second step is to open a command line shell
and changed the directory to the same directory where these CSV files reside. Lines 1 and
2 in Code 29.3 creates a new table named movie. Inside the parenthesis are the details
of the four fields. The first is the mid which is an integer and also the primary key. It is
also set for automatic increments. This means that each time a new entry is added to the
table the value of mid is one more than the previous value. Thus, it is not necessary to
enter the values of mid when the data is entered.
It is also possible that an error can exist in the creation of the table. There is no
control-Z in MySQL. One option of correcting a disastrous error is to start over. This
requires the destruction of the table which is performed in line 3. Then the correct
449
Code 29.3 Creating a table in MySQL.
1 mysql> CREATE TABLE movie (mid INTEGER AUTO_INCREMENT PRIMARY KEY,
2 name VARCHAR(100), year INTEGER, grade INTEGER);
3 mysql> DROP TABLE movie;
command for the creation of the table can be entered. Each command in MySQL is
followed by a semicolon. Failure to include this is not disastrous as MySQL will simply
provide a prompt waiting for the user to complete the command with a semicolon.
Code 29.4 shows the command that will upload a CSV file into an existing table.
The two variables that the user needs to adjust are the name of the CSV file (which in this
example is movies.csv) and the name of the table where the data will be inserted (which
in this case is movies).
The process needs to be performed for all pages in the spreadsheet. The user needs
to create the table and then upload the data. This process uses several commands and the
command line interface is not very friendly. A good option is to copy successful MySQL
commands to a text editor. This will allow the user to employ the text editor tools to
create new commands. These, then, can be copied to the command line for execution.
There are many ways to insert data into a table and some of these will be reviewed
in later chapters. However, there is a global alternative that uses the UNIX command
mysqldump. This program is run from the UNIX command line instead of the MySQL
command line. This command can dump an entire database into a text file as shown in
line 1 of Code 29.5. This command will dump the database named database into a text
file named dumpfile.sql. Line 2 is used to load the database stored in this file back into
MySQL. The file dumpfile.sql is a text file and so it can be transferred from one machine
to another. If the file is already available, then the user can use line 2 to create the tables
and load the database.
A query is executed through the MySQL command as shown in Code 29.6. Line 1
is the same MySQL command used in the previous examples. The rest is the response
returned by MySQL which can be copied from the command line and pasted to a word
450
processor.
The command line interface is very basic and users may prefer a graphical interface.
The MySQL Workbench is an excellent tool that is freely available that will provide a
graphic front end to the MySQL database.
29.4 Summary
There are many different DBMS available. Tools that are suitable for the rest of the
chapters are Microsoft Access, LibreOffice Base and MySQL. The latter two are available
without cost. Any of these products are suitable for personal databases and each uses the
MySQL command language.
451
452
Chapter 30
Fundamental Commands
This chapter will review some of the fundamental MySQL commands that manipulate and
retrieve data from a single table. This will include commands to upload and to receive
answers from queries. Commands that use multiple tables are discussed in Chapter 31.
As the commands are reviewed the appropriate queries from the list in Chapter 28 will be
revealed.
Code 29.4 showed a method of uploading an entire tab delimited file into a table. This
section will review methods of appending to a table and altering features of a table. The
first few commands are used to set up a database. These are followed by commands to
set up tables and to populate the tables.
A user may have several databases within a DBMS. The movies and actors examples use
the movie database, but it is quite possible to generate other databases. The creation of
a database is performed by the CREATE DATABASE command shown in line 1 of Code
30.1. Line 2 selects which database will be used in the subsequent queries.
453
30.1.2 Creating a Table
The creation of tables is performed with the CREATE TABLE command as shown in Code
30.2. In this example the name of the table is movies. Following that is text inside of
parenthesis that defines the attributes (or columns in the tables). Each column gets a
name and a data type. One of the attributes must be defined as the primary key. The
AUTO INCREMENT command indicates that this particular attribute will increment with
each entry. In the first tuple this entry is 1, in the second tuple this entry is 2, and so on.
This is automatic which means that the user will not have to insert data for this attribute.
The VARCHAR(100) datatype indicates that name is a string that can have up to 100
characters.
The SHOW TABLE command displays the individual tables within a database. The
example in Code 30.3 is performed after all three tables are created. command displays
the individual tables within a database. The example in Code 30.3 is performed after all
three tables are created.
454
Code 30.4 Describing a table.
1 mysql> DESCRIBE movies;
2 +-------+--------------+------+-----+---------+----------------+
3 | Field | Type | Null | Key | Default | Extra |
4 +-------+--------------+------+-----+---------+----------------+
5 | mid | int(11) | NO | PRI | NULL | auto_increment |
6 | name | varchar(100) | YES | | NULL | |
7 | year | int(11) | YES | | NULL | |
8 | grade | int(11) | YES | | NULL | |
9 +-------+--------------+------+-----+---------+----------------+
A single row of data is inserted into the database using the INSERT command. The user
can select which columns are being used. The first entry in the movies table is the “A
Face in the Crowd” and the grade is 9. Even though the movie was released in 1957 this
information is not included in this command. Again, since the column mid is an automated
column the user does not supply information for it.
The command to insert this data is shown in Code 30.6. The INSERT INTO command
will add a row to the table. The first set of parentheses indicate which columns are being
supplied with data. This is followed by the keyword VALUES. The second set of parentheses
supply the data. In this case the name of the movie is a string and is thus enclosed in
quotes. Furthermore, since this is data, capitalization is maintained unlike keywords.
This command does insert data at the end of table. It is not advisable to insert data
in the middle of the table because it will alter the correlation between tuples and keys.
For example, with the auto incrementing key for the movies table each new movie keys a
unique key. If in this example, “Star Wars” was the next movie added then its mid would
be 2. However, if later “Key Largo” where to be inserted above “Star Wars” then the mid
for “Star Wars” would be changed to 3. This will cause serious problems in the isin table
as now all of the entries for mid that are 2 and greater would need to be altered. So, the
rule of thumb is that data is added to the table at the end.
It is also possible to insert more than one row at a time. Code 30.7 uploads two
455
rows of data. In this case, three columns of data will be used. Following VALUES there
are two sets of parentheses which each supply a single row of data. The number of rows
that can be inserted is not strictly limited. There are two immediate caveats. The first is
that all entries must have the same number of columns and the second is that the total
length of the INSERT INTO command is limited. The latter comes into play if there are
many rows trying to be uploaded in a single command.
30.2 Updating
Once a table has been created it can be modified. Columns can be added and removed.
The data type of the columns can be modified, but this may also be incompatible with
previously stored data. The ALTER command is used for all table modifications. Code 30.8
shows just two of the many possible uses. Line 1 creates a new column newcol for the
table table. Line 2 changes this column to a BIGINT data type.
Many other uses include renaming the table or columns, altering the key columns,
managing memory, etc.
An example was that in the first version of the database the movie “Nurse Betty”
was misspelled. Correction was achieved with the UPDATE command as shown in Code
30.9.
30.3 Privileges
The creator of the database has the option of limiting access to the data. Limitations
include blocking access to certain tables or even specific columns. Access can be controlled
456
differently so that some users can add data and others can only read data. These privileges
are controlled through the GRANT command.
Like most commands in MySQL there are a myriad of options which are too numer-
ous to list. Code 30.10 shows just of few of these commands. In Line 1, all privileges are
assigned for all databases to all users.
Line 2 indicates which commands are available to all users. Line 3 assigns commands
to all users for the Movies database. Line 4 assigns the privilege of SELECT to one column
and INSERT to a second column for the table actors. Line 5 grants privileges to all users
only if they are logged into the host computer. Line 6 eliminates the user named “badboy”.
Line 7 grants privileges to Bill but requires Bill to use the password “mypass”.
In the MySQL language every command must end with a semicolon. This allows the user
to write a command that extends multiple lines with each line ending with a typed newline
character. In that fashion, long commands can be typed in an organized manner that is
easier for the user to read. Convention is that MySQL keywords are typed as capital
letters and the user defined fields and variables are typed in lowercase. This is merely
a convenience for the human reader as MySQL does not distinguish between upper and
lowercase commands. Some of the example queries that follow will return long answers
and only the first few rows are printed here.
Finally, the query language shown here is for MySQL. Users of LibreOffice Base
or Microsoft Excel may find that they need to make some minor changes to appease the
dialect of their engine. A couple of notable changes are that some of the field names are
also MySQL keywords. For example, the word year is a field name in the movies table
and also a keyword. If the user is referring to the field name then it may be necessary
to enclose the word in quotes, as in SELECT "year" FROM movies. Another item is that
division of two integers in MySQL returns a float. In LO Base it returns an integer. So, it
is necessary to convert an integer to a float using the CONVERT command, as in SELECT
AVG(CONVERT(grade,float)) FROM movies.
457
The basic query is of the form SELECT field FROM table WHERE condition. The
SELECT field defines the data fields that will be returned. The FROM table defines which
table is being in used. In this chapter, the queries will use only a single table. Queries
with multiple tables are reviewed in Chapter 31. The WHERE condition defines which
records will be returned. Without this part of the command the query would return all of
the data from the table.
Consider again, Query Q1 from Section 28.2. This seeks the name of the movie with
mid = 200. Code 30.11 shows the basic command that selects the name of the movie from
the table named movies for only the film that has mid = 200.
Query Q2 seeks movies released in the year 1955. The query is shown in Code 30.12.
This command is similar to Code 30.11 except that the condition is changed. As seen the
query returned four movies that fit this condition.
The three basic data types are numbers, dates, and strings. The first, of course,
represents numerical data and the last represents textual data. Databases, particularly in
commerce, also rely heavily on dates and times. Thus, there are data types specifically
dedicated to the representation of time and dates.
30.4.1 Numbers
The most common types of numerical data are integers, decimals and floating point values.
Integers are whole numbers such as ... -2, -1, 0, 1, 2 ... The decimals are non-integers with
458
a dedicated precision. The number of digits before and/or after the decimal place are set
by the user. These types of numbers are useful for currency which has a finite precision
of 1 cent. Floating point numbers are real numbers.
30.4.1.1 Integers
Even within the class of integers there are several different types. These are listed in Table
30.1 and differ in their precision and thus range of values. Integers with a small range
consume viewer bytes. For small applications this is not really a concern, but in extremely
large databases the users must also manage their consumption of disk space.
Each integer as a signed and unsigned integer version. The signed versions uses one
bit to represent the sign and thus have one less bit to represent the value. Thus, the
maximum value is just under one half compared to the unsigned versions.
It would seem that the INT type would suffice for most applications as the max-
imum value is over 4 billion. Paris japonica is a plant that sports the largest known
genome,[wikipedia, 2016] with 150 billion bases. So, the INT data type would not be able
to precisely represent the number of bases in a single plant.
30.4.1.2 Decimals
The NUMERIC or DECIMAL data types define a decimal number with a defined number of
digits before and after the decimal point. These are used in cases in which some precision
is required such as in currency. The floating point type (next topic) can induce bit errors
thus presenting $0.01 as $0.0099999. This is not acceptable and in such cases a DECIMAL
or NUMERIC type should be used.
The syntax is myvar DECIMAL(m, d) where m is the total number of digits and d is
the number of digits after the decimal point.
A FLOAT or REAL is a generic floating point variable stored in 4 bytes. The DOUBLE or
DOUBLE PRECISION data type uses 8 bytes. It is possible to declare the number of digits
459
Table 30.2: Date and time.
Type Format
DATE ’YYYY-MM-DD’
DATETIME ’YYYY-MM-DD HH:MM:SS’
TIME ’HH:MM:SS’
YEAR(2) or YEAR(4) Year with specified digits
30.4.1.4 Bit
The BIT data type stores a specified number of bits as is myvar BIT(m). This data is
usual for cases in which a small number of values are used. For example, if a variable were
to only have the values 0, 1, 2, or 3 then a BIT type would be far more efficient. Even if
the database is small in size this data type can be useful as it would prevent the variable
from assuming a value outside of the range.
A default value is assigned to an attribute as the data is being loaded into the database.
The user, of course, can override the default value. This is an optional argument. An
example is shown in Code 30.13.
30.4.3 Dates
Date and time information can be stored in different formats which are shown in Table
30.2.
30.4.4 Strings
A string is a collection of characters and MySQL offers many different types of strings
since their uses are so varied.
The CHAR(m) data type allocates memory for m characters even if the input data
does not actually need all m characters.
460
The VARCHAR(m) type allows for up to m characters but does not consume all of the
m bytes if the data is shorter.
The BINARY(m) and VARBINARY(m) data types are similar to CHAR and VARCHAR
except that the data is considered to be binary instead of text characters.
A BLOB is similar to the BINARY in that it stores a bytes string without regard to the
ASCII representation of the data. There are four types: TINYBLOB, BLOB, MEDIUMBLOB,
and LONGBLOB which can store lengths of 28-1, 216-1, 224-1, and 232-1 bytes respectively.
The TEXT data type stores long textual strings and comes in four similar types:
TINYTEXT, TEXT, MEDIUMTEXT, and LONGTEXT which can store lengths of 28-1, 216-1, 224-
1, and 232-1 bytes respectively. Thus, a TEXT can store up to 64 kilobytes, a MEDIUMTEXT
can store up to 16 megabytes, and a LONGTEXT can store up to 4 gigabytes.
An enumeration is a collection of string objects from which the attribute can assume a
value. Basically, the attribute can only be one of the members of the enumeration. A
creation of the a table with an enumeration would be of the form shown in Code 30.14.
Here the variable sizes can only be one of three strings.
A SET can have zero of more members and the maximum number of unique strings
is 64.
MySQL has data types that correspond to OpenGIS classes. Some of these types hold
single geometry values:
GEOMETRY
POINT
LINESTRING
POLYGON
GEOMETRY can store geometry values of any type. The other single-value types
(POINT, LINESTRING, and POLYGON) restrict their values to a particular geometry type.
461
Table 30.3: Converting data.
Type Format
BINARY Convert to binary
CAST Converts to specified type
CONVERT Converts to specified typec
30.5 Conversions
Data can be stored as one type but during the retrieval can be converted to another type.
The three major functions are shown in Table 30.3.
The example shown in Code 30.15 uses retrieves the mid from the table Movies and
converts this integer into decimal for display. This does not change the stored data but
just the format of the retrieved data.
MySQL contains many functions that facilitate the construction of queries. Queries can
include mathematical process of both the query conditions and query response.
Basic math functions are available in MySQL. Standard mathematical operators are shown
in Table 30.4
An example is shown in Code 30.16 where the returned grade of the movie is mul-
tiplied by 2. In this case the original grade was 6 and the returned answer is 12.
462
Table 30.4: Math operators.
Type Description
DIV Integer division
/ Division
- Subtraction
% or MOD Modulus
+ Addition
* Multiplication
- Change sign
Table 30.5 shows the math functions which operate on the returned value or the arguments
used in WHERE statements.
Code 30.17 a case in which the math function is applied to the argument rather than
the returned value. In this case the input value is 4.5 but is rounded off and converted to
an integer. This is used as the mid and the mid and title of the movie are returned.
30.6.3 Operators
463
Table 30.5: Math functions.
464
30.6.4 Hierarchy
It is possible that a MySQL command to have multiple operators. The following list
depicts the operators and the order in which they are executed within a command.
1. INTERVAL
2. BINARY, COLLATE
3. !
4. - (unary minus), (unary bit inversion)
5. ˆ
6. *, /, DIV, %, MOD
7. -, +
8. <<, >>
9. &
10. |
11. = (comparison), <=>, >=, >, <=, <, <>, ! =, IS, LIKE, REGEXP, IN
12. BETWEEN, CASE, WHEN, THEN, ELSE
13. NOT
14. AND, &&
15. XOR
16. OR, ||
17. = (assignment), :=
For example, if an equation was 5 + 6 ∗ 3 the hierarchy as shown would execute the
multiplication before the addition. So this answer is 23. The above list shows the order for
all of the commands in the order that they will be executed. If the uses wishes to change
the order then parentheses are employed. If the desire is for the addition to be performed
first then the user should use (5 + 6) ∗ 3 which produces the answer of 33.
The aggregation functions are shown in Table 30.7. There are commands for simple
mathematical information such as an average or standard deviation.
The tools in the previous tables are useful for several of the queries from the list in Section
28.2. Query Q3 seeks the list of movies with a grade above a certain value. The query
and the first five returned movies are shown in Code 30.18. The condition for the grade
is that it is greater than or equal to a given value.
Query Q4 seeks the movies in a decade which means that there is an upper and
lower limit on the year. Code 30.19 shows two possible commands that return the same
465
Table 30.7: Aggregate functions.
Type Description
AVG() Return the average value of the argument
BIT AND() Return bitwise and
BIT OR() Return bitwise or
BIT XOR() Return bitwise xor
COUNT(DISTINCT) Return the count of a number of different values
COUNT() Return a count of the number of rows returned
GROUP CONCAT() Return a concatenated string
MAX() Return the maximum value
MIN() Return the minimum value
STD() Return the population standard deviation
STDDEV POP() Return the population standard deviation
STDDEV SAMP() Return the sample standard deviation
STDDEV() Return the population standard deviation
SUM() Return the sum
VAR POP() Return the population standard variance
VAR SAMP() Return the sample variance
VARIANCE() Return the population standard variance
466
results. The first four results are also shown. In either case, the movie must have a year
between 1950 and 1959 (inclusive) to be returned by this query.
The DISTINCT is used to display each returned answer only once. Query Q5 seeks
the years that contained movies with the worst grade of 1. However, there may be some
years that have more than one movie with that grade. The goal is to return that year
only once. The query is shown in Code 30.20. There is a movie that does not have a year
assigned to it and the requirement that year>1900 excludes this movie. The years are
returned and there are no duplicates. There also seems to be no sense to the order that
they are returned. The first movie in the database with a grade of 1 is from the year 2007,
and thus it is the first year shown.
Query Q6 seeks the number of actors from the movie with mid = 200. This query
is not seeking the list of actors, just the number of actors. The command COUNT returns
the number of items returned. The number of entries in the isin table with the mid =
467
200 is the number of actors. Thus, the command in Code 30.21 returns the count of the
number of rows from this table that meet the condition.
Code 30.21 Returning the number of actors from a specified movie.
1 mysql> SELECT COUNT(aid) FROM isin WHERE mid=200;
2 +------------+
3 | COUNT(aid) |
4 +------------+
5 | 5 |
6 +------------+
Query Q7 is to return the average grade of the movies from the 1950’s. The ap-
propriate command to employ is AVG. A solution is shown in Code 30.22. It should be
noted that in some dialects of MySQL that the average over integer values is returned as
an integer. The solution is to convert the data to floats before the average is computed as
in AVG(CONVERT(grade,float)).
As seen in Code 30.22 the function is listed in the heading over the results. In this
case that heading is not too long, but in other cases that employ multiple functions that
heading can be intruding on the presentation of the results. The solution is to rename
the function with AS as shown in Code 30.23. This renaming actually has a much bigger
purpose. In more complicated queries it is possible that the phrase (such as AVG(grade)
is repeated in the query. The relabeling of that function allows the user to use the new
name throughout the query.
Multiple functions are used in Query Q8 which seeks the the average and standard
deviations of the length of the movie titles. A solution is shown in Code 30.24.
468
Code 30.24 Statistics on the length of the movie name.
1 mysql> SELECT AVG(LENGTH(title)), STD(LENGTH(title)) FROM movies;
2 +--------------------+--------------------+
3 | AVG(LENGTH(title)) | STD(LENGTH(title)) |
4 +--------------------+--------------------+
5 | 15.1025 | 8.070749268190655 |
6 +--------------------+--------------------+
There are numerous functions that apply to strings and the tables Table 30.8 through
Table 30.14 display them grouped by subcategories.
Type Description
ASCII Return numeric value of left-most character
BIT LENGTH Return length of argument in bits
CHAR LENGTH Return number of characters in argument
FORMAT Return a number formatted to specified number of decimal places
HEX Return a hexadecimal representation of a decimal or string value
LENGTH Return the length of a string in bytes
ORD Return character code for leftmost character of the argument
Type Description
FIELD() Return the index of the first argument in the subsequent arguments
LIKE Simple pattern matching
LOCATE() Return the position of the first occurrence of substring
MATCH Perform full-text search
NOT LIKE Negation of simple pattern matching
POSITION() Synonym for LOCATE()
SOUNDEX() Return a soundex string
SOUNDS LIKE Compare sounds
STRCMP() Compare two strings
SUBSTRING INDEX() Return a substring of a specified number of occurrences
Query Q9 seeks first names of the actors with the last name of Keaton. The condition
for equating a string is similar for equating a numerical value. Code 30.25 shows this
example.
Query Q10 seeks actors who have “John” in their first name. In this case the first
469
Table 30.10: Informative string operators
Type Description
BIN() Return a string containing binary representation of a number
CHAR() Return the character for each integer passed
ELT() Return string at index number
FIND IN SET() Return the index position of the first argument within the second argument
INSTR() Return the index of the first occurrence of substring
OCT() Return a string containing octal representation of a number
UNHEX() Return a string containing hex representation of a number
Type Description
LEFT() Return the leftmost number of characters as specified
MID() Return a substring starting from the specified position
LTRIM() Remove leading spaces
RTRIM() Remove trailing spaces
RIGHT() Return the specified rightmost number of characters
SUBSTR() Return the substring as specified
SUBSTRING() Return the substring as specified
TRIM() Remove leading and trailing spaces
Type Description
LCASE() Synonym for LOWER()
LOWER() Return the argument in lowercase
UCASE() Synonym for UPPER()
UPPER() Convert to uppercase
Type Description
CONCAT WS() Return concatenate with separator
CONCAT() Return concatenated string
EXPORT SET() Return a string such that for every bit set in the value bits
INSERT() Insert a substring at the specified position up to the specified number of characters
LPAD() Return the string argument, left-padded with the specified string
MAKE SET() Return a set of comma-separated strings that have the corresponding bit in bits set
REPEAT() Repeat a string the specified number of times
REPLACE() Replace occurrences of a specified string
REVERSE() Reverse the characters in a string
RPAD() Append string the specified number of times
SPACE() Return a string of the specified number of spaces
470
Table 30.14: Miscellaneous operators.
Type Description
LOAD FILE() Load the named file
NOT REGEXP Negation of REGEXP
QUOTE() Escape the argument for use in an SQL statement
REGEXP Pattern matching using regular expressions
RLIKE Synonym for REGEXP
name is not necessary just those four letters. The LIKE function uses wild cards to
search for a sequence of letters embedded in a text entry. The percent sign is used for an
undetermined number of letters and and underscore is used for a single letter. To find a
first name with any number of letters before and after “John” the percent signs are used
as shown in Code 30.26.
Query Q11 seeks the actors who have two parts in their first name. These two parts
are separated by a single space and so the equivalent search is to find the first names with
a blank space. It is possible to search on a blank space in between two percent signs as
in “% %”. However, this would also include entries that begin or end with a blank space.
Code 30.27 shows a better search which uses the underscores to ensure that there is at
least one character before and one character after the blank space. Combined with the
percent signs this search finds names that have one or more letters before and after the
471
blank space.
Code 30.27 Finding the actors with two parts to their first name.
1 mysql> SELECT firstname,lastname FROM actors WHERE firstname LIKE ’%_ _%’;
2 +----------------+----------+
3 | firstname | lastname |
4 +----------------+----------+
5 | F. Murray | Abraham |
6 | James (Jimmy) | Stewart |
7 | Michael J. | Fox |
8 | M. Emmet | Walsh |
Query Q12 returns the actors that have matching initials in their names. The SUB-
STR function extracts a substring from a string. The function has three arguments which
are the string, the starting location of the extraction, and the length of the extractions.
The first initial, then, is the substring that starts at location 1 and has a length of 1. Code
30.28 shows Query Q12 which is to find the actors that have matching initials. Basically,
the first letter of the first name must be the same as the first letter of the last name.
Queries can return a large number of rows and the user needs to see only a few. One
example would be to find the best record according to a criteria. The query could sort all
of the data, but the user needs to see only the top few returns. Control of the number of
records returned is managed by the LIMIT function. Code 30.29 shows a simple example
that returns just three of the actors with the first name of John. These are the first three
that are stored in the database.
Sorting is controlled by the ORDER BY command. This identifies which field is to
be used in sorting. If the data is text then the sort is alphabetical. If the data is numeric
472
Code 30.29 Example of the LIMIT function.
1 mysql> SELECT lastname FROM actors
2 WHERE firstname=’John’ LIMIT 3;
3 +----------+
4 | lastname |
5 +----------+
6 | Belushi |
7 | Turturro |
8 | Candy |
9 +----------+
then the data is sorted by value. The keywords DESC and ASC are used to indicate if
the sorting the from high to low or low to high with the latter being the default.
Query Q13 is to list the actor’s last names in alphabetical order for those actors
whose first name is John. Code 13 shows the result in which line 2 defines the search
conditions and line 3 sorts the data. Without line 3 the data is returned by the order in
which it was entered into the database. To reverse the order of the returned answer the
command would be changed to ORDER BY lastname DESC.
Query Q14 is to list the movies according to the length of their titles. The LENGTH
function returns the length of the string and it is on this function that the sort is to be
applied. The result is shown in Code 30.31. The sort would be from the smallest to the
largest values of the length by default but the DESC command reverses that search and
only the first 5 are printed.
Query Q15 is to sort the actors by the location of the substring ‘as’ in their first
name. This uses the LOCATE function to return the location of a substring within a
string. This function is used twice in this query. The first is to find the locations and the
second is to use that information as the sorting criteria. When a function is used twice
with the same arguments it is both convenient and efficient to rename that application of
473
Code 30.31 The movies with the longest titles.
1 mysql> SELECT title FROM movies
2 ORDER BY LENGTH(title) DESC LIMIT 5;
3 +-------------------------------------------------------------------------+
4 | name |
5 +-------------------------------------------------------------------------+
6 | Everything You Always Wanted to Know About Sex * But Were Afraid to Ask |
7 | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb |
8 | Pirates of the Caribbean: The Curse of the Black Pearl |
9 | Marilyn Hotchkiss’ Ballroom Dancing & Charm School |
10 | The Russians are Coming, the Russians are Coming |
11 +-------------------------------------------------------------------------+
the function with the AS command. The query is shown in Code 30.32. The location of
the target substring is shown in line 1 and renamed as L. Then in line 2 the sorting is over
this same L.
30.9 Grouping
Grouping data in MySQL is the act of collecting data according to a certain criteria.
Consider the case of Query Q16 which is to compute the average movie grade for each
year. Each year can have several movies and so the data needs to be collected by the year.
This action is quite similar to a nested for loop. If this function were to be written
474
in Python then the user would create a for loop over each year and then collect that data
for that year in a nested for loop.
In MySQL the GROUP BY command is used to collect data. This is used over the
same variable that is the first for loop in the Python example. Query Q16 is shown in
Code 30.33. The values to be returned are the year and average grade of the year. Line
3 uses the GROUP BY command to sort the data by the year. For each year, the average
grade is computed.
Query Q17 is to also sort the data from the best year to the worst according to this
average grade. The GROUP BY command is used to collect the data by years and the ORDER
BY command is to change the order of the answer. The query is shown in Code 30.34.
This command works well, but includes years that have just a few movies. It is not
really fair to compare the movies of 1948 to other years if 1948 has only one movie. So,
Query Q18 adds the restriction that there must be at least five movies in a year or it is
not considered.
This is the same as putting an if statement inside of the for loop in Python. The
MySQL command is GROUP BY ... HAVING, where the HAVING command acts as the if
475
statement. The example is shown in Code 30.35 where the condition is that there must be
more than 5 movies. The COUNT function is applied to the mid since it is the primary
key.
Code 30.35 Restricting the search to years with more than 5 movies.
1 mysql> SELECT year, AVG(grade) AS g FROM movies
2 GROUP BY year HAVING COUNT(mid)>5
3 ORDER BY g DESC;
4 +------+--------+
5 | year | g |
6 +------+--------+
7 | 1944 | 7.0000 |
8 | 1968 | 6.8750 |
9 | 1975 | 6.8333 |
10 | 2000 | 6.7500 |
11 | 2006 | 6.6579 |
The functions for dates and time are numerous and simply listed.
476
FROM UNIXTIME() Format UNIX timestamp as a date
GET FORMAT() Return a date format string
HOUR() Extract the hour
LAST DAY Return the last day of the month for the argument
LOCALTIME(), LOCALTIME Synonym for NOW()
LOCALTIMESTAMP, LOCALTIMESTAMP() Synonym for NOW()
MAKEDATE() Create a date from the year and day of year
MAKETIME() Create time from hour, minute, second
MICROSECOND() Return the microseconds from argument
MINUTE() Return the minute from the argument
MONTH() Return the month from the date passed
MONTHNAME() Return the name of the month
NOW() Return the current date and time
PERIOD ADD() Add a period to a year-month
PERIOD DIFF() Return the number of months between periods
QUARTER() Return the quarter from a date argument
SEC TO TIME() Converts seconds to ’HH:MM:SS’ format
SECOND() Return the second (0-59)
STR TO DATE() Convert a string to a date
SUBDATE() Synonym for DATE SUB() when invoked with three arguments
SUBTIME() Subtract times
SYSDATE() Return the time at which the function executes
TIME FORMAT() Format as time
TIME TO SEC() Return the argument converted to seconds
TIME() Extract the time portion of the expression passed
TIMEDIFF() Subtract time
TIMESTAMP() With a single argument, this function returns the date or datetime
expression; with two arguments, the sum of the arguments
TIMESTAMPADD() Add an interval to a datetime expression
TIMESTAMPDIFF() Subtract an interval from a datetime expression
TO DAYS() Return the date argument converted to days
UNIX TIMESTAMP() Return a UNIX timestamp
UTC DATE() Return the current UTC date
UTC TIME() Return the current UTC time
UTC TIMESTAMP() Return the current UTC date and time
WEEK() Return the week number
WEEKDAY() Return the weekday index
WEEKOFYEAR() Return the calendar week of the date (0-53)
YEAR() Return the year
YEARWEEK() Return the year and week
Code 30.36 shows the simple example of retrieving the current date using the
CURDATE() command. There is an equivalent command for retrieving and another for
both as shown in Code 30.37.
477
Code 30.36 Using CURDATE.
1 mysql> SELECT CURDATE();
2 +------------+
3 | CURDATE() |
4 +------------+
5 | 2015-06-22 |
6 +------------+
7 1 row in set (0.09 sec)
30.11 Casting
Table 13.15 displays the casting operators that can change the type of data.
Type Description
BINARY Cast a string to a binary string
CAST() Cast a value as a certain type
CONVERT() Cast a value as a certain type
30.12 Decisions
Every language needs the ability to make decisions and MySQL is no different. There are
two types of decisions which are the CASE and IF statements with variants as shown in
Table 30.16
478
Code 30.38 Casting data types.
1 mysql> SELECT CAST(4 AS decimal);
2 +--------------------+
3 | CAST(4 AS decimal) |
4 +--------------------+
5 | 4.00 |
6 +--------------------+
7 1 row in set (0.05 sec)
Type Description
CASE Case operator
IF - ELSE If/else construct
IFNULL Null if/else construct
NULLIF Return NULL if expr1 = expr2
30.12.1 CASE-WHEN
IF(expr1,expr2,expr3)
where expr2 is the output if expr1 is true and expr3 is the output if expr1 is false. This
is very similar to the IF command format in a spreadsheet.
479
Code 30.39 Using CASE.
1 mysql> SELECT aid, lastname,
2 CASE lastname
3 WHEN ’Kelly’ THEN ’Irish’
4 WHEN ’Niven’ THEN ’English’
5 ELSE ’Dunno’
6 END AS ’Fun’
7 FROM actors WHERE firstname=’David’;
8 +-----+------------+---------+
9 | aid | lastname | Fun |
10 +-----+------------+---------+
11 | 225 | Niven | English |
12 | 244 | Bowie | Dunno |
13 | 339 | Suchet | Dunno |
14 | 486 | Carradine | Dunno |
15 | 519 | Keith | Dunno |
16 | 552 | Straithern | Dunno |
17 | 602 | Kelly | Irish |
18 +-----+------------+---------+
19 7 rows in set (0.03 sec)
This example is to list the mid and last names of the actors whose first name is
‘David’. If their mid is greater than 500 then print ‘Late’ otherwise print ‘Early’. The
result is shown in Code 30.40 and once again the AS command is used to alter the heading
in the print out in line 7.
IFNULL(expr1,expr2)
which returns expr1 if expr1 is not NULL. If expr1 is NULL then expr2 is returned.
Two examples are shown in Code 30.41.
The example for a full text comparison is shown through a few pieces of script all of which
are from [MyS, ]. The first in Code 30.42 creates a table and attention should be drawn
to line 5 which declares a FULLTEXT index over two of the user defined variables.
480
Code 30.40 Using IF.
1 mysql> SELECT aid, lastname,
2 IF (aid>500,’Late’,’Early’)
3 AS period
4 FROM actors
5 WHERE firstname = ’David’;
6 +-----+------------+--------+
7 | aid | lastname | period |
8 +-----+------------+--------+
9 | 225 | Niven | Early |
10 | 244 | Bowie | Early |
11 | 339 | Suchet | Early |
12 | 486 | Carradine | Early |
13 | 519 | Keith | Late |
14 | 552 | Straithern | Late |
15 | 602 | Kelly | Late |
16 +-----+------------+--------+
17 7 rows in set (0.02 sec)
481
Code 30.42 The FULLTEXT operator.
1 mysql> CREATE TABLE articles (
2 id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
3 title VARCHAR(200),
4 body TEXT,
5 FULLTEXT (title,body)
6 ) ENGINE=MyISAM;
A natural language search uses the command QUERY EXPANSION as shown in Code
30.45. The search is again on the word ‘database’ but a second search is performed that
uses returned words from the first search as the query in the second search. In this case
the first search returned MySQL which responsible for the third item returned.
482
Code 30.45 Using QUERY-EXPANSION.
1 mysql> SELECT * FROM articles
2 -> WHERE MATCH (title,body)
3 -> AGAINST (’database’ WITH QUERY EXPANSION);
4 +----+-------------------+------------------------------------------+
5 | id | title | body |
6 +----+-------------------+------------------------------------------+
7 | 1 | MySQL Tutorial | DBMS stands for DataBase ... |
8 | 5 | MySQL vs. YourSQL | In the following database comparison ... |
9 | 3 | Optimizing MySQL | In this tutorial we will show ... |
10 +----+-------------------+------------------------------------------+
30.13 Problems
1. Write a single MySQL command that will return the name of the movie that has
mid = 300.
2. Write a single MySQL command that returns the highest grade of the two movies
with mid = 300 or mid = 301.
3. Write a single MySQL command that returns the lowest grade of movies from the
1960’s.
4. Write a single MySQL command that returns the names and years of the movie with
the lowest grade from the 1960’s. (Use the result from the previous problem.)
5. Write a single MySQL command that returns the number of Harry Potter movies
that are in the database.
6. Write a single MySQL command that returns the number of movies that have the
language with lid = 6.
7. Write a single MySQL command that returns the first and last names of the actors
who have a last name that begins with ‘As’.
8. Write a single MySQL command that returns the first and last name of the actors
that have the same last three letters in their first and last names. (Example, Slim
Jim has the same last three letters in the first and last names.)
483
484
Chapter 31
The previous queries captured information from a single table. This chapter will consider
queries that require multiple tables.
The query Q19 starts with the actor’s aid and requests the titles of the movies that this
actor has been in. The aid information is contained in the isin and actors tables while the
title information is stored in the movies table. Thus, the query must involve more than
one table.
The database schema, or the design of the tables, contains equivalent fields in mul-
tiple tables. For this query, it is important to note that the isin table and the movies table
both contain the movie mid value. In this schema, both fields are also named mid but this
is a convenience rather than a requirement.
31.1.1 Schema
A properly designed schema will allow the user to create queries to answer all needed
questions. Often the design of the schema begins with a collection of the questions that
are expected to be asked of the database.
The schema for the movies database is shown in Figure 31.1. The fields of each table
are listed. The primary key is the first entry and denoted by a symbol. The lines between
the tables show the fields that are common with the tables. In this view it is possible to
see that all tables are connected and so it is possible to start with any type of information
and pursue the answer that is in another table. If the query were to find the actors that
were in movies from a certain country, this schema figure shows that the query would need
to route through the country, incountry, isin and actors tables. This information would
then lead to the construction of the query.
485
Figure 31.1: The movies schema.
The easy method of linking tables is to simply include the tables in the query and have
a condition that equates their common fields. This may not be the most efficient means,
but it is a good place to start.
Query Q19 seeks the names of the movies starting with an actor’s aid. This requires
the use of the isin and movies table, where the common field is the mid value. The query
is shown in Code 31.1. Line 1 shows the values to be returned and the mid field now has
the table declaration. In this query there are two fields named mid and thus it is necessary
to declare which table is to be used for the return. The values are the same and so using
either movies.mid or isin.mid produces the same answer. The second field returned is
the movie title and there is no disambiguity as to which table this field resides.
Line 2 lists the two tables involved in this query separated by a comma. Multiple
tables can be listed, but in this case only two are needed. Line 3 links the two tables
together. This line indicates that the values in the movies.mid field are the same values
as in the isin.mid. Line 4 finishes the conditions of the query by indicating that the aid
= 281.
Query Q20 takes this concept one step further by starting with the actor’s name
instead of an aid value. According to the schema shown in Figure 31.1 it is necessary
to start with the actors table, progress through the isin table and finish with the movies
table. Thus, three tables are involved as shown in line 2 of Code 31.2. Line 3 connects
the movies table to the isin table and the isin table to the actors table. Line 4 finishes
486
Code 31.1 A query using two tables.
1 mysql> SELECT movies.mid, title
2 FROM movies, isin
3 WHERE movies.mid=isin.mid
4 AND isin.aid=281;
5 +-----+--------------------------+
6 | mid | name |
7 +-----+--------------------------+
8 | 44 | For the Love of the Game |
9 | 229 | 9 |
10 | 347 | Shadows and Fog |
11 | 554 | A Prairie Home Companion |
12 +-----+--------------------------+
the conditions.
The functions shown in previous queries are available in queries that use multiple tables.
Query Q21 requests the average grade for John Goodman movies. This is similar to the
previous query in that it is necessary to convert the actor’s name to an aid, convert that to
multiple mid values, and finally converting those to movie titles. The only read difference
is line 1 as shown in Code 31.3.
Query Q22 requests the movies that are in French. This requires the langs, inlang
and movies tables. Structurally, the query is similar to the previous and the query is
shown in Code 31.4.
Query Q23 seeks the languages of the Peter Falk movies. This query requires four
tables to travel from the actors table to the langs table. The query also requires that each
language be listed only once. The query is shown in Code 31.5 in which line 1 uses the
DISTINCT function to prevent multiple listings of any language. Line 2 lists the four
tables and lines 3 and 4 tie them together. Line 5 finishes the conditions of the query.
Query Q24 seeks the movies that have both Maggie Smith and Daniel Radcliffe. This is
an extension of a previous query that requested movies from a single actor, which itself
was an extension of Query Q19 that started with the aid and progressed to the movie
title using just two tables. The query path is diagrammed in Figure 31.2 which shows the
tables as ovals.
The issue with Query Q24 is that the same tables are used multiple times. The
487
Code 31.2 A query using three tables.
1 mysql> SELECT movies.mid, title
2 FROM movies, isin, actors
3 WHERE movies.mid=isin.mid AND isin.aid=actors.aid
4 AND actors.firstname=’John’ AND actors.lastname=’Goodman’;
5 +-----+-----------------------------------------------------+
6 | mid | name |
7 +-----+-----------------------------------------------------+
8 | 78 | Monsters Inc. |
9 | 95 | Raising Arizona |
10 | 88 | O Brother, Where Are Thou |
11 | 119 | The Big Lebowski |
12 | 278 | Revenge of the Nerds |
13 | 291 | The Flintstones |
14 | 435 | Marilyn Hotchkiss’ Ballroom Dancing & Charm School |
15 | 661 | Matinee |
16 | 682 | True Stories |
17 | 779 | The Artist |
18 +-----+-----------------------------------------------------+
488
Code 31.4 Movies in French.
1 mysql> SELECT movies.mid,title
2 FROM movies, inlang, langs
3 WHERE movies.mid=inlang.mid AND inlang.lid=langs.lid
4 AND langs.language=’French’;
5 +-----+------------------------------------------+
6 | mid | name |
7 +-----+------------------------------------------+
8 | 14 | Blame it on Fidel |
9 | 54 | Hotel Rwanda |
10 | 60 | Jesus of Montreal |
11 | 80 | Munich |
489
actors and isin tables are used for the Maggie Smith component of the query and then
again for the Daniel Radcliffe component. Determining the mid values for Maggie Smith
is independent of the search for the mid values for Daniel Radcliffe. It is only after the
mid values for both actors have been collected that they are combined. Thus, the use of
the actors and isin tables for Maggie Smith are used independently of those used in the
Daniel Radcliffe. Basically, the query requires two distinct uses of the same tables.
Query Q25 extends this one step further as it searches for the other actors that are
in the same movies as Radcliffe and Smith. The mid values in common with these two
actors use isin and actors a third time to get names of other actors. This query uses these
two tables three independent times in the query. The flow of this query is shown in Figure
31.3.
Multiple uses of the same tables is handled by renaming instances of the tables with
different labels. In Query Q24 the Daniel Radcliffe portion of the query, instances of the
isin and actors tables are renamed i1 and a1 respectively. Likewise, the Maggie Smith
portion of the query uses tables named i1 and a2.
This query is shown in Code 31.6. Line 2 creates the two instances of these tables
along with the movies table which is needed to retrieve the movie titles. Line 3 connects
the movies table to the two instances of the isin table. Line 4 connects the isin tables
to their respective actors tables. The last two lines create the condition for the actor’s
names.
In Q25, three instances of the actors table are used as shown in Figure 31.3. Small
numbers are placed next to the table names to indicate which instance is being used.
Numbers above the attribute names are used just for referencing here in the text. On
the left in circles 1 and 4 are the names of the two target actors. These are converted to
their aid numbers which are converted to their list of movies in circles 3 and 6. These are
combined with an intersection so that circle 7 is the list of mids in which had both actors.
Circle 8 contains the aids of all actors in those movies and their names are revealed in
circle 9.
Consider the transition from circle 1 to circle 2. In this step the name Daniel
Radcliffe is converted into an aid using the actors table. The query is shown in Code 31.7
and as seen his aid is 238. A similar query is performed from Maggie Smith to reveal that
490
Code 31.6 Movies common to Daniel Radcliffe and Maggie Smith.
1 mysql> SELECT movies.mid, title
2 FROM movies,isin AS i1, isin AS i2, actors AS a1, actors AS a2
3 WHERE movies.mid=i1.mid AND movies.mid=i2.mid
4 AND i1.aid=a1.aid AND i2.aid=a2.aid
5 AND a1.firstname=’Daniel’ AND a1.lastname=’Radcliffe’
6 AND a2.firstname=’Maggie’ AND a2.lastname=’Smith’;
7 +-----+------------------------------------------+
8 | mid | name |
9 +-----+------------------------------------------+
10 | 184 | Harry Potter and the Sorcerer’s Stone |
11 | 186 | Harry Potter and the Prisoner of Azkaban |
12 | 187 | Harry Potter and the Goblet of Fire |
13 +-----+------------------------------------------+
The next step is to use the isin table to convert the aid into a list of mids for the
movies that Radcliffe has been in. This requires the use of both the actors and the isin
tables and the query is shown in Code 31.8.
A similar query can be performed for Maggie Smith and the mids for both will be
combined in circle 7. In order for this to occur it will be necessary to perform two searches
on the actors and isin tables. These two individual searches are performed by renaming
each table twice with different names. First, consider the rename of the tables for just the
Radcliffe portion of the query which is shown in Code 31.9.
The small numbers in the rectangles in Figure 31.3 coincide with the renaming of
the tables. The rectangle that has ‘actors 1’ is a1 in the query.
The next step is to duplicate this query for Maggie Smith and using i2 and a2
instead of i1 and a1. These two queries must then be combined such that only those mids
that are in common survive. The query is shown in Code 31.10 with line 4 isolating the
common mids.
491
Code 31.8 Radcliffe’s mid.
1 mysql> SELECT mid FROM isin, actors
2 WHERE isin.aid=actors.aid
3 AND actors.firstname=’Daniel’ AND actors.lastname=’Radcliffe’;
4 +------+
5 | mid |
6 +------+
7 | 184 |
8 | 185 |
9 | 186 |
10 | 187 |
11 | 400 |
12 +------+
13 5 rows in set (0.57 sec)
492
Code 31.10 The mids with both Smith and Radcliffe.
1 mysql> SELECT i1.mid, i2.mid
2 FROM isin AS i1, actors AS a1, isin AS i2, actors as a2
3 WHERE i1.aid=a1.aid AND i2.aid=a2.aid
4 AND i1.mid=i2.mid
5 AND a1.firstname=’Daniel’ AND a1.lastname=’Radcliffe’
6 AND a2.firstname=’Maggie’ AND a2.lastname=’Smith’;
7 +------+------+
8 | mid | mid |
9 +------+------+
10 | 184 | 184 |
11 | 186 | 186 |
12 | 187 | 187 |
13 +------+------+
14 3 rows in set (1.47 sec)
Line 1 selects the mids from both actors and as shown in the output only one was
really necessary. Line 2 creates two names for each table with a1 and i1 being used for
Radcliffe and a2 and i2 being used for Smith. Line 3 connects the aid attribute for each
pair (a1,i1) and (a2,i2). Line 4 connects the two isin tables which will perform the
intersection necessary to get to circle 7. Lines 5 and 6 create the targets and the results
are shown starting in line 7. As seen there are 3 such movies.
The next step is to convert those mids to aids of all of the actors that are in those
movies. This will require a third query through the isin table. The query is shown in
Code 31.11 which will show the actor’s aid and the mid of the movies.
Line 2 adds the isin AS i3 component which will be used to convert mids to aids.
The linkage is made in line 3 which connects the mid of the third isin table with the mid
of the first isin table. In this case the second isin table could have been used instead of
the first. The rest of the query is the same. The results show the aid of the actor in one
of the three movies.
The only result that is needed is the aids of the actors and duplicates are not desired.
So, the query is modified slightly in Code 31.12 to extract just the aids and to use the
DISTINCT keyword to remove the duplicates. The results are single instances of the aids
of the actors that were in movies with Smith and Radcliffe. This is circle 8 in Figure 31.3.
The final step is easy and that is to convert the aids to names. However, this
requires another query through the actors table to convert their aids back to their names.
Query Q25 is completed in Code 31.13 Line 1 requests information from the third instance
of the actors table. Lines 2 and 3 define the instances of the tables, lines 4 through 6 tie
the tables together. Lines 7 and 8 set the search conditions and the results are shown
below.
493
Code 31.11 The aid of other actors.
1 mysql> SELECT i3.aid, i1.mid
2 FROM isin AS i1, actors AS a1, isin AS i2, actors as a2, isin AS i3
3 WHERE i3.mid=i1.mid
4 AND i1.aid=a1.aid AND i2.aid=a2.aid AND i1.mid=i2.mid
5 AND a1.firstname=’Daniel’ AND a1.lastname=’Radcliffe’
6 AND a2.firstname=’Maggie’ AND a2.lastname=’Smith’;
7 +------+------+
8 | aid | mid |
9 +------+------+
10 | 236 | 184 |
11 | 237 | 184 |
12 | 222 | 184 |
13 | 238 | 184 |
14 | 128 | 184 |
15 | 228 | 184 |
16 | 680 | 184 |
17 | 240 | 186 |
18 | 222 | 186 |
19 | 237 | 186 |
20 | 228 | 186 |
21 | 238 | 186 |
22 | 680 | 186 |
23 | 222 | 187 |
24 | 237 | 187 |
25 | 228 | 187 |
26 | 238 | 187 |
27 | 680 | 187 |
28 +------+------+
29 18 rows in set (1.40 sec)
494
Code 31.12 Unique actors.
1 mysql> SELECT DISTINCT(i3.aid)
2 FROM isin AS i1, actors AS a1, isin AS i2, actors as a2, isin AS i3
3 WHERE i1.aid=a1.aid AND i2.aid=a2.aid AND i1.mid=i2.mid AND i1.mid=i3.mid
4 AND a1.firstname=’Daniel’ AND a1.lastname=’Radcliffe’
5 AND a2.firstname=’Maggie’ AND a2.lastname=’Smith’;
6 +------+
7 | aid |
8 +------+
9 | 236 |
10 | 237 |
11 | 222 |
12 | 238 |
13 | 128 |
14 | 228 |
15 | 680 |
16 | 240 |
17 +------+
18 8 rows in set (1.59 sec)
Code 31.13 Actors common to movies with Daniel Radcliffe and Maggie Smith.
1 mysql> SELECT DISTINCT a3.firstname, a3.lastname
2 FROM movies, isin AS i1, isin AS i2, isin AS i3,
3 actors AS a1, actors AS a2, actors AS a3
4 WHERE i1.mid=movies.mid AND i1.aid=a1.aid
5 AND i2.mid=movies.mid AND i2.aid=a2.aid
6 AND i3.mid=movies.mid AND i3.aid=a3.aid
7 AND a1.firstname=’Daniel’ AND a1.lastname=’Radcliffe’
8 AND a2.firstname=’Maggie’ AND a2.lastname=’Smith’;
9 +-----------+-----------+
10 | firstname | lastname |
11 +-----------+-----------+
12 | Richard | Harris |
13 | Maggie | Smith |
14 | Robbie | Coltrane |
15 | Daniel | Radcliffe |
16 | Ed | Harris |
17 | Alan | Rickman |
18 | Emma | Thompson |
19 | Warwick | Davis |
20 +-----------+-----------+
495
31.2 Joining Tables
Linking tables can certainly be performed as shown in the previous section, but as the
queries become more complicated it is important to consider the efficiency of the query.
Just as in programming, if the query statement is poorly constructed then the search can
be very inefficient. The solution is to link the tables through a the JOIN command. There
are four main types of joins:
The INNER JOIN is the same as JOIN and works similar to the commands in the
previous section. Code 31.14 shows the commands that use the JOIN-ON construct. The
first table is listed and then the second table is listed after ON with the attributes that are
linked.
Code 31.14 The mids for Cary Grant.
1 mysql> SELECT isin.mid
2 FROM isin JOIN actors ON isin.aid=actors.aid
3 WHERE firstname=’Cary’
4 AND lastname=’Grant’;
5 +------+
6 | mid |
7 +------+
8 | 83 |
9 | 267 |
10 | 297 |
11 | 298 |
12 | 343 |
13 | 387 |
14 | 267 |
15 | 432 |
16 +------+
17 8 rows in set (0.16 sec)
The inner join is shown pictorially in Figure 31.4 which shows data from two tables
A and B. The data that is returned is the shaded area. In the ongoing example these are
the actors and isin tables and the shaded area includes those entries that have the same
aid.
The database currently has three movies that have the substring ‘under’ in the title
as shown in Code 31.15. Consider another query shown in Code 31.16 which uses JOIN
496
Figure 31.4: Inner join.
and links the mids from the movies and isin tables.
There are two major differences in the output. First, the movie Under the Bombs is
not listed, and second the other two movies are listed multiple times. The movie Tropic
Thunder is listed six times because there are six actors associated with this movie in the
isin table. Likewise, Under the Tuscan Sun has two entries because two actors are listed
in isin. Each returned tuple is unique because the isin.isid attribute is unique.
A LEFT JOIN is shown pictorially in Figure 31.5 which shows that this query will include
items from table A and items that are in both A and B.
The query is shown in Code 31.17 which replaces JOIN with LEFT JOIN. As seen
there is now a new entry for the movie Under the Bombs. It was excluded from Code
497
Code 31.16 Inner join with multiple returns.
1 mysql> SELECT m.mid, m.title, i.mid, isid FROM movies AS m
2 JOIN isin AS i ON m.mid=i.mid
3 WHERE m.title LIKE ’%Under%’;
4 +-----+----------------------+------+------+
5 | mid | title | mid | isid |
6 +-----+----------------------+------+------+
7 | 160 | Tropic Thunder | 160 | 285 |
8 | 160 | Tropic Thunder | 160 | 286 |
9 | 160 | Tropic Thunder | 160 | 287 |
10 | 160 | Tropic Thunder | 160 | 288 |
11 | 160 | Tropic Thunder | 160 | 289 |
12 | 160 | Tropic Thunder | 160 | 290 |
13 | 324 | Under the Tuscan Sun | 324 | 963 |
14 | 324 | Under the Tuscan Sun | 324 | 964 |
15 +-----+----------------------+------+------+
16 8 rows in set (0.03 sec)
31.16 because there are no actors for this movie listed in the table isin. However, in the
LEFT JOIN query the movie is included because it is in table A.
The RIGHT JOIN has a similar concept as does the LEFT JOIN excepting which table is
fully included. The pictorial representation is shown in Figure 31.6. In this example the
RIGHT JOIN does not have a different output from the JOIN because every entry in the
isin table has an associated movie.
An OUTER JOIN would include all entries from all tables even if there are entries in one
table that have no associations in the other table. The logic is shown in Figure 31.7.
Other types of joins are shown in the following images and codes.[Moffatt, 2009] The
498
Code 31.17 Left join with multiple returns.
1 mysql> SELECT m.mid, m.title, i.mid, isid FROM movies AS m
2 LEFT JOIN isin AS i ON m.mid=i.mid
3 WHERE m.title LIKE ’%Under%’;
4 +-----+----------------------+------+------+
5 | mid | title | mid | isid |
6 +-----+----------------------+------+------+
7 | 160 | Tropic Thunder | 160 | 285 |
8 | 160 | Tropic Thunder | 160 | 286 |
9 | 160 | Tropic Thunder | 160 | 287 |
10 | 160 | Tropic Thunder | 160 | 288 |
11 | 160 | Tropic Thunder | 160 | 289 |
12 | 160 | Tropic Thunder | 160 | 290 |
13 | 324 | Under the Tuscan Sun | 324 | 963 |
14 | 324 | Under the Tuscan Sun | 324 | 964 |
15 | 491 | Under the Bombs | NULL | NULL |
16 +-----+----------------------+------+------+
17 9 rows in set (0.06 sec)
499
left excluding join includes those items in A but not in B and the code is shown in Code
31.18.
(a) Left excluding join (b) Right excluding join (c) Outer excluding join
Query Q26 is to return a list of the movies with each actor’s aid. The query is
shown in Code 31.19 using the RIGHT JOIN.
A functional dependency occurs when tuples contain elements that agree with other tuples.
Consider a case in which a table has several columns C1 to CN , and in this case some of
the elements agree across multiple tuples. For example, in some rows of the table there
are cases in which the first three columns agree. That means if R1 contains c1 , c2 , and c3
as values for the first three columns and if R2 has the same values then they agree.
A functional dependency occurs if for the same values of c1 and c2 there is only one
possible c3 . Basically, the value of the third column can be predicted by the values in the
first two columns. The keys of relation are dependent columns. In the example these are
the first two columns. A functional dependency for this case is written as C1 , C2 → C3 .
It then follows that if A → B and B → C then A → C.
31.3 Subqueries
A subquery is a query nested within another query. Much like nested for loops in Python,
the improper use of subqueries can lead to processes that consume far too much time and
resources. Subqueries should be used with care and only when necessary.
500
Code 31.19 The movie listed with each actor.
1 mysql> SELECT m.mid, m.title, i.aid
2 FROM movies AS m
3 RIGHT JOIN isin AS i ON m.mid=i.mid
4 WHERE m.title LIKE ’%Under%’;
5 +------+----------------------+------+
6 | mid | name | aid |
7 +------+----------------------+------+
8 | 160 | Tropic Thunder | 88 |
9 | 160 | Tropic Thunder | 94 |
10 | 160 | Tropic Thunder | 196 |
11 | 160 | Tropic Thunder | 197 |
12 | 160 | Tropic Thunder | 27 |
13 | 160 | Tropic Thunder | 164 |
14 | 324 | Under the Tuscan Sun | 466 |
15 | 324 | Under the Tuscan Sun | 479 |
16 | 160 | Tropic Thunder | 734 |
17 | 776 | Undertaking Betty | 270 |
18 | 776 | Undertaking Betty | 748 |
19 +------+----------------------+------+
Query Q19 sought the name of a movie for an actor with a given aid. Code 31.20
shows the same query but with the use of a subquery. Line 2 contains the subquery within
parenthesis which returns the mid values for a given actor. This returns multiple values
which are then used in the primary query.
When the results from a subquery are being used in a condition it is necessary to
assign an alias to the subquery. This subquery is in line 2 and renamed t, which is then
used in line 3 in a condition.
Efficient use of subqueries is a bit tricky to accomplish in complicated queries. The
user is highly encouraged to test each subquery to ensure that the response that they
expect is indeed the response that is returned.
31.4 Combinations
Queries Q27 and Q28 use mulitple devices to retrieve the correct results. Q27 seeks the 5
actors that have been in the most movies. This is a simple enough query to understand
but complicated to achieve. It is necessary to count the number of movies that all actors
have been in before it is possible to find the top 5.
The query is shown in Code 31.22 where line 1 shows the items to be retrieved which
come from multiple tables. The last item is the count of movies which is assigned an alias
501
Code 31.20 The use of a subquery.
1 mysql> SELECT title FROM movies WHERE mid IN
2 (SELECT mid FROM isin WHERE aid=12);
3 +-------------------------------------+
4 | name |
5 +-------------------------------------+
6 | Back to the Future |
7 | Interstate 60: Episodes of the Road |
8 | Twenty Bucks |
9 | Who Framed Roger Rabbit |
10 | Addam’s Family Values |
11 | The Addams Family |
12 | The Dream Team |
13 | To Be or Not to Be |
14 | My Favorite Martian |
15 +-------------------------------------+
502
because this count is used later in the query. Line 2 lists the two tables and connects
them. Line 3 uses the GROUP BY function to collect the counts for each actor. Line 4
then orders the returned data and uses LIMIT to print out just the top five.
Query Q28 seeks the actors with the best average score with the condition that the
actors have been in at least five movies. Code 31.23 shows the query where once again
the return shows data from multiple tables including one with the average function. Lines
2 and 3 declares which tables are used and how they are connected. Line 4 groups the
data but adds the condition that the number of movies must exceed five. Line 5 orders
the return and limits the results.
503
31.5 Summary
The real power of database searches is the ability to combine information from multiple
tables. This may be a simple trace through a schema or a query that involves multiple
instances of tables or subqueries. This chapter displayed single queries that retrieved data
that was difficult to retrieve using a spreadsheet.
Problems
1. Write a single MySQL command that returns an alphabetical list of all of the movies
from 1985.
2. Retrieve the years of the movies starring Cary Grant.
3. Write a single MySQL command that returns the number of movies that are in
English.
4. Write a single MySQL command to return the average grade of movies in Spanish.
5. What is the average grade for movies with Elijah Wood?
6. In a single command, return the averages for movies with either Elijah Wood or
John Goodman.
7. How many movies was Dan Aykroyd in?
8. Write a single MySQL command that returns the first and last names of the actors
that are in the Harry Potter movies. This list should be alphabetically ordered by
last name and have no duplicates.
9. Write a single MySQL command to determine the name of the actor that was in the
most movies?
10. Write a single command to determine if Peter Falk was in a movie with a language
other than English.
11. In a single MySQL command return the names of the countries of the movies that
starred Pete Postlethwaite. The answer should have no duplicates.
12. Write a single MySQL command that displays the year and average grade for the
year with the highest average grade and at least 7 movies.
13. Write a single MySQL command that returns the names of the movies that have
both Steve Martin and Humphrey Bogart.
14. List in the names of the actors that were in movies with a grade of 9 or better.
This list should be alphabetical according to the last name of the actor and no actor
should be listed more than once.
504
Chapter 32
MySQL has the ability to search data and even have functions that iteratively consider
the data. However, languages such as Python are far more powerful in data analysis than
MySQL. Thus, it is prudent to connect the two systems together. In this manner Python
scripts can use MySQL to sift through the data stored in a database and then perform
complicate analysis on that data. It also allows a program to send several queries to the
database in an effort to obtain the desired information.
There are three basic steps in the process. The first is to connect to the database, the
second is to deliver a query statement to the database, and the third is to receive any data
that the database is returning. This section will review all three processes.
There are several third party tools that can be used to connect Python to MySQL. This is
the case with any language actually. Programmers in Java and C++ also need to import
a tool that makes this connection.
The popular tool for Python 2.7 users has been mysqldb which (at the time of writing
this chapter) is not available for Python 3.x. The popular tool for uses of Python 3.x is
pymysql. This is included in packages such as Anaconda. The import statement is shown
in line 1 of Code 32.1.
There are four possible pieces of information that are needed to connect to the
database. These are the name of the host machine if different than the user’s machine,
the name of the database, the name of the MySQL user, and the user’s MySQL password.
These are established as strings of which the first is shown in line 2 and the creation of
the others is assumed in line 3. Line 4 makes the connection to the database using these
505
four variables. Note that the variable for the password is passwd ad password is a Python
keyword. The final step is to define the cursor which is the avenue by which Python will
communicate with MySQL. This is performed in line 5. Finally, line 6 can be used to close
the connection.
Now, the connection is made and the two systems are ready to communicate. The
next step is to send a MySQL command and receive the data.
Sending a query to the database and receiving the responses is quite easy. The pymysql
module has functions other than the ones that will be shown in this section, but the ones
shown here are sufficient for many applications.
The process of sending a query is to create a string in Python that is the desired
MySQL command without the semicolon, and then to send that string via the cursor that
was created in line 5 of Code 32.1. Line 2 in Code 32.2 creates a string and line sends it to
the database using the execute command. The value of n is the number of lines returned
by the query. There is a similar command named executemany which will be shown in
Code 32.3.
There are three common methods in which the data can be retrieved by Python. In
all cases each line of data is stored as a tuple even if the data returned contains only a single
value. It will be a single value inside of a tuple. Line 4 uses the fetchone command to
retrieve one line of the MySQL return. Repeated uses of fetchone will retrieve consecutive
lines in the return. In a sense this command is all that is required, however, there are two
506
other commands that can provide convenience. Line 5 shows the fetchall that retrieves
all of the lines from the MySQL query into a tuple. Each line is also a tuple and so the
return from this command is a tuple that contains tuples. The fetchmany function is
similar except that the users specifies that only n lines are returned.
The variable answ from line 4 is a tuple. The number of items in the tuple is the
number of items that are returned from the query. From this point forward, the user
employs Python scripts to extract the data from the tuple and to further process the
information.
There are MySQL commands to alter the content or tables of a database. These, too,
can be managed through the Python interface. However, there is a small commitment
that the user must enforce in order for the changes to become permanent. Consider Code
32.3 which uses the execmany command to upload three changes to the database in
lines 1 through 5. If the user were to query the database they would see these changes.
However, if the user were to log out then the changes would be destroyed. Line 6 shows
the commit function that uses the connection created in line 3 of Code 32.1. This will
make the changes permanent.
Once the cursor is created several queries can be sent to the database as shown in Code
32.4. The cursor does not have to be reestablished after every query. It is important to
note that care should be exercised with multiple queries. It is possible that users will send
a large number of small queries to the database. If this database is one a server that is a
far distance from the user then there is a time cost to receiving the data. Thus, a large
number of small queries can be a recipe for a slow running program. Likewise, a full table
dump can also be expensive as a large amount of data must travel across the network.
The rule of thumb is to minimize the number of queries as well as minimizing the
amount of data to be retrieved. So, the user should attempt to perform as much pruning as
possible with the MySQL command. If the DBMS is on the same computer as the Python
507
Code 32.4 Sending multiple queries.
scripts then this issue slackens and the time required to retrieve the data is significantly
less.
Query Q29 is to compute the average grade for each decade. Creating a MySQL
command to compute the average grade of the movies in a single decade is not difficult.
The plan then is to create a Python loop that creates this command for each decade.
The command for one decade is shown in Code 32.5 where act is the string to be sent to
MySQL. The condition that grade>0 excludes those movies with an invalid grade.
1 >>> act = ' SELECT AVG ( grade ) FROM movies WHERE year BETWEEN
1920 AND 1929 AND grade >0 '
2 >>> cursor . execute ( act )
3 1
4 >>> float ( cursor . fetchone () [0])
5 7.5
Code 32.6 shows a solution to Q29. The for loop iterates through each decade. The
string act is similar to the previous except that the years change. Each query is then sent
to the database and the answer is received and printed.
The Kevin Bacon effect was discussed in Section 28.3. Basically, actors are in movies with
other actors and the purpose is to find the connection from a given actor to Kevin Bacon.
For example, the path from Johnny Depp to Kevin Bacon follows this path. Johnny
Depp and Dianne Wiest were in Edward Scissorhands, Wiest and Steve Martin were in
Parenthood, and Martin and Kevin Bacon were in Planes, Trains & Automobiles. In this
database, this is the shortest path from Depp to Bacon.
Computing the shortest path is performed by the Floyd-Warshall algorithm pre-
508
Code 32.6 Sending multiple queries.
9 1920 7.5
10 1930 7.8
11 1940 6.64
12 1950 6.6667
13 1960 6.0333
14 1970 5.9138
15 1980 5.5575
16 1990 5.9506
17 2000 6.0314
sented in Section 26.3.2. In this case all of the actor data will be needed so the entire
table is downloaded and parsed. The process begins in Code 32.7 with two functions.
The first is Connect which receives the MySQL host computer URL, the name of the
database, the MySQL user name and password. It returns the connection to the database,
db, and the cursor, c. The second function is DumpActors which returns all of the actors
names in a dictionary where the key is the actor’s aid. This dictionary is returned by the
function in line 18.
The second step is to create the connected graph by the function MakeG shown in
Code 32.8. The result is a matrix G which is a binary valued matrix. The i-th row and the
i-column corresponds to the i-th actor. It should be noted that first index in the matrix
is 0 and the first aid is 1, thus row index and aid currently differ by a value of 1. This
will change. The item G[i, j] is set to 1 if the actors corresponding to row i and the actor
corresponding to column j were in the same movie.
The third step is to apply the Floyd-Warshall algorithm as seen in Code 32.9. The
function RunFloyd calls the FastFloydP function which returns two matrices that are
used to define the shortest path between any two entities and the distance of that path.
There are actors in this database that can not be connected to Kevin Bacon. Basi-
cally, there are several disconnected graphs. There is one large graph that contains most
of the actors and then a few small graphs that tend to be actors in movies outside of the
USA that just have not been connected to the big graph. These spurious actors need to
be removed from the matrices in order to proceed.
509
Code 32.7 The DumpActors function.
1 # bacon . py
2 def Connect (h ,d ,u , p ) :
3 db = pymysql . connect ( host =h , user =u , db =d , passwd = p )
4 c = db . cursor ()
5 return db , c
6
7 def DumpActors ( c ) :
8 act = ' SELECT * FROM actors '
9 c . execute ( act )
10 dump = c . fetchall ()
11 actors = {}
12 for i in dump :
13 aid , fname , lname = i
14 actors [ aid ] = fname , lname
15 return actors
16
Identifying these actors is quite easy. The first actor in the database is Leonardo
DiCaprio which is an actor that belongs to the large graph. The values of the matrix f
from the RunFloyd function indicate the distance of the shortest path. The first row of
this matrix has several values of 7 or less which indicates that DiCaprio is connected to
the actors. Since most of the values are small it is easy to conclude that DiCaprio belongs
to the one large graph. There are a few cells that have the superficially large value of
9999999 corresponds to actors that are not connected to DiCaprio and therefore do not
belong to the large graph. These are the actors that need to be removed from further
consideration. This is performed by the RemoveBadNoP function shown in Code 32.10.
The inputs are the matrices f, G and p. The akeys is a list of the keys from the actors
dictionary (see line 24) and row is the index of the row to be used as the anchor. In this
case DiCaprio was the anchor and as he is the first actor row = 0.
This function creates a new G matrix, named G1, which is the graph without the
unconnected actors. So, this matrix is a bit smaller than the original. The function also
returns the matrix p.
Since some of the actors have been removed it is necessary to execute the RunFloyd
function again as shown in Code 32.11. Line 2 finds the shortest path between entities 8
and 10. These are the rows in the matrices and are not the aid of the actors. Originally,
the row index and actor aid were offset by a value of 1. However, since actors have
been removed even this guide is no longer valid. As seen in line 5 of Code 32.11 the row
412 corresponds to actor with aid = 421. The values returned by FindPath shows that
510
Code 32.8 The MakeG function.
1 # bacon . py
2 def MakeG ( c , actors ) :
3 NA = len ( actors )
4 G = np . zeros ( ( NA , NA ) )
5 keys = list ( actors . keys () )
6 for i in range ( NA ) :
7 fname , lname = actors [ keys [1]]
8 act = ' SELECT DISTINCT a2 . aid FROM isin AS i1 , isin
AS i2 , '
9 act += ' actors AS a2 , actors AS a1 WHERE i1 . aid = a2 .
aid '
10 act += " AND i2 . mid = i1 . mid AND a1 . aid = i2 . aid AND a1 .
aid = "
11 act += str ( i +1)
12 c . execute ( act )
13 ans = c . fetchall ()
14 N = len ( ans )
15 for j in range ( N ) :
16 col = int ( ans [ j ][0] )
17 G [i , col-1] = 1
18 return G
19
Code 32.9 .
1 # bacon . py
2 def RunFloyd ( G ) :
3 GG = np . zeros ( G . shape )
4 GG = G + (1-G ) *9999999
5 ndx = np . indices ( GG . shape )
6 pp = G * ndx [0]
7 f , p = floyd . FastFloydP ( GG , pp )
8 return f , p
9
511
Code 32.10 The RemoveBadNoP function.
1 # bacon . py
2 def RemoveBad ( f , G , p , akeys , row ) :
3 hits = ( f [ row ] >999999) . nonzero () [0] # columns of those to
remove
4 hits . sort ()
5 hits = hits [::-1]
6 for i in hits :
7 print ( i )
8 N = len ( G )
9 newG = np . zeros (( N-1 , N-1) )
10 newp = np . zeros (( N-1 , N-1) )
11 newG [: i ,: i ] = G [: i ,: i ] + 0
12 newG [: i , i :] = G [: i , i +1:] + 0
13 newG [ i : ,: i ] = G [ i +1: ,: i ] + 0
14 newG [ i : , i :] = G [ i +1: , i +1:] + 0
15 newp [: i ,: i ] = p [: i ,: i ] + 0
16 newp [: i , i :] = p [: i , i +1:] + 0
17 newp [ i : ,: i ] = p [ i +1: ,: i ] + 0
18 newp [ i : , i :] = p [ i +1: , i +1:] + 0
19 a = akeys . pop ( i ) # remove this actor
20 G = newG + 0
21 p = newp + 0
22 return G , p
23
512
the path starts with entity 8, to entity 412 and ends with entity 10. Using akeys it is
determined that the corresponding aid values are 9, 421, and 11. These correspond to
actors Tom Hanks, Martin Sheen and Michael J. Fox. This is the shortest path between
Hanks and Fox.
Using this method it is possible to discover the shortest path between any two actors.
The shortest path between any two actors is the geodesic distance. The goal of Query
30 is to find the longest geodesic distance. This is a pair of actors who have a very long
shortest distance. Again the information is readily available since the geodesic distances
are in matrix f1. The location of the maximum values indicates which two actors are at
each end of this path.
There may be several pairs of actors that have the same geodesic distance. The
function Trace in Code 32.12 finds one of those pairs and prints out the actor’s names
and movies. In order to get this information it is necessary to send several commands to
the database. This is the string act which is inside of the for loop. The result allows the
user to find the path that is the longest geodesic distance between two actors. In this case
the actor path is: Arliss Howard, Debra Winger, Nick Nolte, Jack Black, Pierce Brosnan,
Robbie Coltrane, Shirley Henderson and Mads Mikkelson. The longest geodesic path is
between the actors Arliss Howard and Mads Mikkelsen. There are other pairs of actors
with the same geodesic length.
32.3 Problems
2. Repeat problem 1 for any of the queries in the list in Section 28.2.
3. The path in Code 32.11 indicated which actors were in the trace but not their com-
mon movies. Write a Python script that accesses the database to find the common
movies for the actors in this list.
4. How many unique pairs of actors have the longest geodesic distance?
513
Code 32.12 The Trace function.
1 # bacon . py
2 def Trace ( f1 , p1 , akeys , actors , c ) :
3 N = len ( f1 )
4 V , H = divmod ( f1 . argmax () , N )
5 print ( ' Max ' , V ,H , f1 [V , H ])
6 tpath = floyd . FindPath ( p1 , V , H )
7 aid = np . array ( tpath ) . astype ( int )
8 for i in aid :
9 ii = akeys [ i ]
10 print ( ' Actor : ' , ii , actors [ ii ] )
11 act = ' SELECT m . name FROM movies AS m , isin WHERE
isin . mid = m . mid '
12 act += ' AND isin . aid = ' + str ( ii )
13 c . execute ( act )
14 ans = c . fetchall ()
15 for k in ans :
16 print ( ' \ t ' , k [0])
514
Bibliography
[MyS, ] Accessed June 2015, MySQL 12.9.1 Natural Languag Full-Text Searches.
[Cormen et al., 2000] Cormen, T. H., Leierson, C. E., and Rivest, R. L. (2000). Introduc-
tion to Algorithms. ”MIT Press.
[Kanaya et al., 2001] Kanaya, S., Kinouchi, M., Abi, T., Kudo, Y., Yamada, Y., Nishi,
T., Mori, H., and Ikemura, T. (2001). Analysis of codon usage diversity of bacterial
genes with a self-organizing map (som): Characterization of horizontally transferred
genes with emphasis on the e. coli o157 genome. Gene, 276:89–99.
[Porter, 2011] Porter, M. (2011 (accessed 27 Jan 2011)). Porter Stemming Algorithm.
http://www.tartarus.org/~martin/PorterStemmer.
515
[wikipedia, 2016] wikipedia (2016 (accessed 25 Aug 2016)). Paris Japonica. https://en.
wikipedia.org/wiki/Paris_japonica.
516
Index
517
atan, 78 BruteForceSlide, 281
ATAN2, 464 BruteForceSlide, 280, 336
atan2, 78, 394 byte, 131
atanh, 78
atomicity, 422 Candlesticks, 269
auto-correlation, 184, 186 CASE, 478, 479
AUTO INCREMENT, 454 CAST, 462, 478
AVERAGE, 39, 428, 429 CatSeq, 348
AVG, 466, 468 CData, 383
CDS, 231
Backtrace, 287 CEIL, 464
bacon CEILING, 464
Connect, 509 cell, 209
DumpActors, 509 CHAR, 460, 470
MakeG, 509 CHAR LENGTH, 469
RemoveBadNoP, 510 CheapClustering, 385
RunFloyd, 509 CheckForStartsStops, 216
Trace, 513 child node, 359
Base, 445 ChopSeq, 333
bell curve, 47, 48, 186, 370 chr, 133, 406
BestPairs, 345 chromosome, 209
BETWEEN...AND, 464 circle, 10
bibtex, 29 area, 10
big endian, 132 cite, 29
BIGINT, 456 CiteSeer, 405
BIN, 470 citeseer, 405
BINARY, 461, 462, 478 class, 173
binary tree, 359, 364 client, 439
BIT, 460 ClusterAverage, 389, 396
bit, 131 clustering, 383
BIT AND, 466 CData, 383
BIT LENGTH, 469 CheapClustering, 385
BIT OR, 466 CompareVecs, 384
BIT XOR, 466 ClusterVar, 387
BLAST, 283 Coding, 223
BLOB, 461 codon, 210, 244, 265
BLOSUM, 277, 283, 297, 299 codon frequency, 244
blosum codonfreq
BlosumScore, 312 Candlesticks, 269
BLOSUM50, 278 CodonFreqs, 267
BlosumScore, 280 CodonTable, 265
BlosumScore, 279, 298, 299, 312 CountCodons, 266
break, 100 GenomeCodonFreqs, 268
Brodatz, 252 CodonFreqs, 267
518
CodonTable, 265 volume, 11
coefficient, 7 CURDATE, 476, 477
CompareVecs, 384 CURRENT DATE, 476
Complement, 216, 234 CURRENT TIME, 476
complement, 89, 331 CURRENT TIMESTAMP, 476
complex, 75 CURTIME, 476
CONCAT, 470 cylinder, 11
CONCAT WS, 470 volume, 11
concurrent access, 422 cytoplasm, 209
Connect, 509 cytosine, 210, 219
consensus string, 312
ConsensusCol, 346 data isolation, 422
constituents, 383 data redundancy, 421
constructor, 177 DATE, 460, 476
contig, 332, 337, 346 DATE ADD, 476
continue, 100 DATE FORMAT, 476
CONVERT, 457, 462, 478 DATE SUB, 476
Convert, 367 DATEDIFF, 476
CONVERT TZ, 476 DATETIME, 460
copy, 324 DAY, 476
correlate, 184 DAYNAME, 476
COS, 464 DAYOFMONTH, 476
cos, 78 DAYOFWEEK, 476
cosh, 78 DAYOFYEAR, 476
cost function, 302, 310, 319, 325 DBMS, 421
CostAllGenes, 348, 350 decidetree
CostFunction, 302, 312, 319, 325 FakeDtreeData, 374
COT, 464 ScoreParam, 375
COUNT, 466, 467, 476 DECIMAL, 459
count, 88 decimal, 458
CountCodons, 266 decision tree, 369, 371, 374
COUNTIF, 41 DecoderDict, 237
cov, 243 def, 159
covariance, 243, 247 default argument, 162
covariance matrix, 192, 242 default value, 460
CRC32, 464 DEGREES, 464
CREATE DATABASE, 453 degrees, 105
CREATE TABLE, 454 degrees, 78
CreateIlist, 292 deoxyribonucleic acid, 209
cross product, 16, 145 dependent variable, 6, 7
cross references, 25 DESC, 473
CrossOver, 320, 326 DESCRIBE, 454
CSV, 123 dictionary, 81, 406
cube, 6, 11 dimredux
519
AllDistances, 251 EXP, 464
PCA, 250 EXPORT SET, 470
Project, 250 EXTRACT, 476
DISTINCT, 466, 467, 487, 493
DIV, 463 FakeDtreeData, 374
divmod, 150, 337 FASTA, 227
DNA, 209 FastFloyd, 395
DNAFromASN1, 237 FastFloydP, 509
dot, 146 FastMat, 336
dot product, 16, 145, 319 FastNW, 292
double helix, 209 FastSubValues, 288
DriveGA, 322 FastSW, 294
Driver, 139 fetchall, 507
DriveSortGA, 329 fetchmany, 507
DROP TABLE, 454 fetchone, 506
dump, 114 FIELD, 469
DumpActors, 509 fields, 437
dynamic programming, 283 file, 111
dynprog file pointer, 111
Backtrace, 287 filter, 427
CreateIlist, 292 FIND, 431
FastNW, 292 find, 88, 175
FastSubValues, 288 FIND IN SET, 470
FastSW, 294 Finder, 339
ScoringMatrix, 286 FindKeywordLocs, 232
SWBacktrace, 294 FindKeywordLocs, 232
dynprog, 286 FiveLetterDict, 408
FLOAT, 459
eig, 246 float, 75
eigenvalue, 245, 247 floating point, 458
eigenvector, 245, 247 FLOOR, 464
elif, 98 Floyd-Warshall, 395, 508, 509
ELSE, 479 for, 99
else, 96 FORMAT, 469
ELT, 470 from import, 169
ENUM, 461, 462 FROM DAYS, 476
enumerate, 102 FROM UNIXTIME, 477
Excel, 33 FULLTEXT, 480
exec, 171 function, 159
execfile, 171 functional dependency, 500
execmany, 507
execute, 506 GA, 317
executemany, 506 ga
exons, 211 CostFunction, 319
520
CrossOver, 320, 326 GnuPlotFiles, 394
DriveGA, 322 Save, 137, 201
Mutation, 321 GnuPlot, 137, 269, 385
gap, 283 GnuPlotFiles, 390, 394
gasort GoodWords, 412
CostFunction, 325 GoPolar, 393
DriveSortGA, 329 GRANT, 457
Jumble, 325 greedy algorithm, 331, 346, 385, 390
Legalize, 326 GROUP BY, 475, 503
Mutate, 328 GROUP BY ... HAVING, 475
gaunine, 210, 219 GROUP CONCAT, 466
Gaussian, 47, 48
Gaussian distribution, 186, 224, 370 hash table, 81
GC content, 219 helix, 209
GCcontent, 219 help, 163
gccontent HEX, 469
Coding, 223 hex, 133
GCcontent, 219 hexadecimal, 131
Noncoding, 222 hexdump, 132
Precoding, 224 hline, 28
StatsOf, 222 Hoover, 406
Genbank, 227, 229, 232 Hoover, 406
genbank HOUR, 477
Complement, 234 hypot, 78
FindKeywordLocs, 232
GeneLocs, 233 identity matrix, 148
GetCodingDNA, 234 IDLE, 213
ParsDNA, 230 IF, 39, 478, 479
ReadFile, 229 if, 95
Translation, 234 IFNULL, 479, 480
gene expression array, 53 importlib
GeneLocs, 233 reload, 169
genetic algorithm, 317, 345 includegraphics, 26
GenomeCodonFreqs, 268 indel, 274
geodesic distance, 395 indiana
GEOMETRY, 461 Convert, 367
geometry, 5 indices, 151
GET FORMAT, 477 IndicWords, 416
GetCodingDNA, 234 inheritance, 174, 179
GetNames, 203 Init1, 388
global, 160 Init2, 388
global alignment, 294 InitGA, 348
global variable, 160, 176 INNER JOIN, 496
gnu inner product, 16, 145
521
INSERT, 455, 457, 470 LENGTH, 469, 473
INSERT INTO, 455 LibreOffice, 30, 33
instance, 173 LibreOffice Base, 439, 445
INSTR, 470 LIKE, 469, 471
int, 75 LIMIT, 472
integer, 458 limit cycle, 262
introns, 211 linalg
IsoBlue, 258 eig, 246
Isolate, 204 LINESTRING, 461
iteration, 98 linked list, 357–359, 364
Linux, 70
JabRef, 30 list, 81
Java Development Kit, 445 little endian, 132
Java runtime environment, 445 LN, 464
JDK, 445 load, 114
JOIN, 496 LOAD FILE, 471
join, 89, 228, 325 LoadBounds, 214
JoinContigs, 342
LoadDNA, 213
JRE, 445
LoadExcel, 200
Juliet, 91
LoadRGBchannels, 258
Jumble, 325
local alignment, 294
Kevin Bacon Effect, 426, 435 local variable, 160, 176
Kevin Bacon effect, 508 LOCALTIME, 477
key, 81 LOCALTIMESTAMP, 477
keys of relation, 500 LOCATE, 469, 473
Kirchhoffs laws, 156 LOESS, 58, 201
kmeans LOESS, 201
AssignMembership, 388 LOG, 464
ClusterAverage, 389 LOG10, 464
ClusterVar, 387 LOG2, 464
Init1, 388 LONGBLOB, 461
Init2, 388 LONGTEXT, 461
Split, 401 LOWER, 470
lower, 88
LAST DAY, 477 LPAD, 470
law of cosines, 13 LTRIM, 470
law of sines, 13
LCASE, 470 MA, 200
Ldata2Array, 200 MacBeth, 168
LEFT, 429, 470 MAKE SET, 470
LEFT JOIN, 496, 497 MAKEDATE, 477
Legalize, 326 MakeG, 509
LEN, 428, 431 MakeRoll, 390
len, 83, 108 MAKETIME, 477
522
maketrans, 91 MIN, 466
mapython miner
AllFiles, 203 AllDcts, 411
GetNames, 203 AllWordDict, 406
Isolate, 204 FiveLetterDict, 408
Ldata2Array, 200 GoodWords, 412
LoadExcel, 200 Hoover, 406
LOESS, 201 IndicWords, 416
MA, 200 WordCountMat, 412
Plot, 201 WordFreqMatrix, 414
Select, 204 WordProb, 414
MATCH, 469 MINUTE, 477
MATCH-AGAINST, 482 mitochondrial DNA, 210
math MOD, 463, 464
acos, 78 module, 168
acosh, 78 MONTH, 477
asin, 78 MONTHNAME, 477
asinh, 78 mRNA, 210
atan, 78 multivariate function, 191
atan2, 78 multivariate normal, 193
atanh, 78 Mutate, 328
cos, 78 Mutation, 321
cosh, 78 mutation, 318
degrees, 78 mycobacterium tuberculosis, 221
hypot, 78 MySQL, 439
pi, 78 MySQL Workbench, 451
pow, 77 mysqldb, 505
radians, 78
sin, 78 National Institutes of Health, 227
sinh, 78 Needleman-Wunsch, 294, 299
sqrt, 77 Neighbors, 396
tan, 78 NewContig, 337
tanh, 78 NIH, 227
MathJax, 27 non-greedy algorithm, 331
matrix, 13 Noncoding, 222
MAX, 466 nongreedy
max, 149, 288, 356 BestPairs, 345
MEDIUMBLOB, 461 CatSeq, 348
MEDIUMTEXT, 461 ConsensusCol, 346
messenger RNA, 210 CostAllGenes, 348, 350
MICROSECOND, 477 InitGA, 348
Microsoft Access, 439 RunGA, 350
MID, 470 SwapMutate, 350
MikTex, 26 nonzero, 143, 386
523
normal, 190 ScrambleImage, 254
NOT, 464 Unscramble, 255
NOT LIKE, 469 PDF, 405
NOT REGEXP, 471 PERIOD ADD, 477
NOW, 477 PERIOD DIFF, 477
nucleotide, 209 PI, 464
nucleus, 209 pi, 78
NULL, 464 pickle, 113
NULLIF, 479 dump, 114
NUMERIC, 459 load, 114
NumPy, 263 pip, 70
numpy, 72, 103, 141, 183, 246, 356 Plot, 201
array, 142 POINT, 461
atan2, 394 polar coordinates, 12
cov, 243 POLYGON, 461
ones, 141 polynomial, 6
zeros, 141 pop, 84, 386
Porter Stemming, 408, 409
object, 173 POSITION, 469
object-oriented programming, 173 POW, 464
OCT, 470 pow, 77
OFFSET, 432 POWER, 464
ones, 141 power terms, 5
online Python, 72 Precoding, 224
open, 111, 114 primary key, 423, 438, 449
OpenGIS, 461 principal component analysis, 241, 247
openpyxl, 126 Project, 250
OR, 464 protein, 210, 312
ORD, 469 pymysql, 505
ord, 133 pyPdf, 406
ORDER BY, 472 Pythagorean theorem, 9, 12, 14
orthonormal, 246, 263 Python Image Library, 72
OUTER JOIN, 496, 498 pythonanywhere, 72
outer product, 16, 145
QUARTER, 477
pack, 134 QUERY EXPANSION, 482
PAM, 277, 283 QUOTE, 471
parent node, 359
ParseDNA, 230 RADIANS, 464
pass, 165 radians, 105
PCA, 241, 246, 247, 263, 271 radians, 78
PCA, 250 RAND, 464
pca rand, 142, 183
LoadRGBchannels, 258 random
524
choice, 193 round, 75
rand, 183 RPAD, 470
ranf, 183 RTRIM, 470
shuffle, 195, 324 RunAnn, 302, 313
random, 102, 142 RunFloyd, 509
random number, 183 RunGA, 350
random numbers, 190 RunKMeans, 390
random slicing, 144
RandomLetter, 313 Save, 137, 201
RandomSwap, 309 SaveData, 139
ranf, 142, 183 scatter plot, 43
range, 99, 108, 163 schema, 438, 485
ReadData, 138 scipy, 72, 141, 184
ReadFile, 229 ScoreParam, 375
ReadPBAS, 138 scoring matrix, 283
ReadRecord, 136 ScoringMatrix, 286
REAL, 459 ScrambleImage, 254
rectangle SEC TO TIME, 477
area, 9 SECOND, 477
rectilinear coordinates, 12 security, 422
reference, 34 seek, 113
absolute, 36 SELECT, 457
relative, 35 Select, 204
REGEXP, 464, 471 sensitivity analysis, 262
relative reference, 35 server, 439
reload, 169 SET, 461
remove, 84 set, 83
RemoveBadNoP, 510 set, 93, 412
REPEAT, 470 set printoptions, 142
REPLACE, 470 ShiftedSeqs, 337
replace, 89 SHOW TABLE, 454
return, 164 ShowContigs, 337, 342
REVERSE, 470 shuffle, 309
rfind, 88 shuffle, 195, 324
rgbpca SIGN, 464
IsoBloue, 258 signed integer, 459
ribosome, 210 simann1
RIGHT, 470 CostFunction, 302
RIGHT JOIN, 496, 498 RunAnn, 302
right triangle, 9 simann2
RLIKE, 471 AlphaAnn, 310
Romeo, 91 simann3
Romeo and Juliet, 91, 166 RandomSwap, 309
ROUND, 464 simann4
525
CostFunction, 312 count, 88
RandomLetter, 313 find, 88, 175
RunAnn, 313 join, 89, 228
simplealign lower, 88
BlosumScore, 279 maketrans, 91
BruteForceSlide, 280 replace, 89
SimpleScore, 276 rfind, 88
SimpleScore, 276 split, 88, 123
SimpleScore, 276 translate, 91
simulated annealing, 301, 302, 313 upper, 88
SIN, 464 struct
sin, 78 unpack, 134
sinh, 78 SUBDATE, 477
slicing, 83, 86 subquery, 500
Smith-Waterman, 294, 296, 297, 299 SUBSTR, 470, 472
Solver, 48, 50 SUBSTRING, 470
sort, 357 SUBSTRING INDEX, 469
SOUNDEX, 469 SUBTIME, 477
SOUNDS LIKE, 464, 469 suffix tree, 410
SPACE, 470 SUM, 39, 466
sphere, 10 sum, 148
area, 10 SwapMutate, 350
volume, 11 SWBacktrace, 294
splice, 211 swissroll
Split, 401 AssignMembership, 396
split, 123 FastFloyd, 395
split, 88, 116 GnuPlotFiles, 390
spreadsheet, 33, 121 GoPolar, 393
SQRT, 464 MakeRoll, 390
sqrt, 77 Neighbors, 396
square, 5 RunKMeans, 390
square root, 6 SYSDATE, 477
standard deviation, 187, 242
start codon, 211 table, 28
StatsOf, 222 tabular, 28
STD, 466 TAN, 464
STDDEV, 466 tan, 78
STDDEV POP, 466 tanh, 78
STDDEV SAMP, 466 tell, 113
STDEV, 39, 429 terminal node, 359
STR TO DATE, 477 TestData, 313
STRCMP, 469 TEXT, 461
string, 86, 460 thymine, 210
ascii lowercase, 324 Tikz, 26
526
TIME, 460, 477 VAR POP, 466
TIME FORMAT, 477 VAR SAMP, 466
TIME TO SEC, 477 VARBINARY, 461
TIMEDIFF, 477 VARCHAR, 461
TIMESTAMP, 477 variable, 5
TIMESTAMPADD, 477 VARIANCE, 466
TIMESTAMPDIFF, 477 variance, 242
TINYBLOB, 461 vector, 13
TINYTEXT, 461 volume, 11
TO DAYS, 477
Trace, 513 WEEK, 477
translate, 91 WEEKDAY, 477
Translation, 234 WEEKOFYEAR, 477
transpose, 146 WHERE, 463
tree, 355 while, 99, 359
Trendline, 45, 48 word, 131
WordCountMat, 412
triangle, 9, 10
WordFreqMatrix, 414
trigonometry, 5
WordProb, 414
TRIM, 470
TRUNCATE, 464 xlrd, 125
tuple, 80 XOR, 464
Tybalt, 91
YEAR, 477
Ubuntu, 71 YEAR(2), 460
UCASE, 470 YEARWEEK, 477
UNHEX, 470
UNIX TIMESTAMP, 477 zeros, 141, 142
unpack, 134
Unscramble, 255
unsigned, 459
Unweighted Pair Group Method with Arith-
metic Mean, 364
UPDATE, 456
UPGMA, 364, 366, 368
UPPER, 470
upper, 88
uracil, 210
ureaplasma parvum serovar, 245
UTC DATE, 477
UTC TIME, 477
UTC TIMESTAMP, 477
value, 81
VALUES, 455
527