Python For Bioinformatics Cap 1
Python For Bioinformatics Cap 1
Introduction
CONTENTS
1.1 Who Should Read This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 What the Reader Should Already Know . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Using this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Typographical Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Python Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Code Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Get the Most from This Book without Reading It All . . . . . . . . . 6
1.2.5 Online Resources Related to This Book . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Why Learn to Program? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Basic Programming Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 What Is a Program? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.1 Main Features of Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.2 Comparing Python with Other Languages . . . . . . . . . . . . . . . . . . . . . 11
Readability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.3 How Is It Used? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.4 Who Uses Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.5 Flavors of Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.6 Special Python Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Amelia Earhart
3
4 Python for Bioinformatics
• The reader should be working (or at least planning to work) with bioinfor-
matics tools. Even low-scale handmade jobs, such as using the NCBI BLAST
to ID a sequence, aligning proteins, primer searching, or estimating a phy-
logenetic tree will be useful to follow the examples. The more familiar the
reader is with bioinformatics, the better he will be able to apply the concepts
learned in this book.
text, it marks a new word or concept. For example “One such fundamental data
structure is a dictionary.”
The content of lines starting with $ (dollar sign) are meant to be typed in your
operating system console (also called command prompt in Windows or terminal
in macOS).
←֓ : Break line. Some lines are longer than the available space in a printed
page, so this symbol is inserted to mean that what is on the next line in the page
represents the same line on the computer screen. Inside code, the symbol used is
<=.
def GetAverage(X):
avG=sum(X)/len(X)
" Calculate the average "
return avG
def get_average(items):
""" Calculate the average
"""
average = sum(items) / len(items)
return average
The former code sample follows most accepted coding styles for Python.2
Throughout the book you will find mostly code formatted as the second sample.
Some code in the book will not follow accepted coding styles for the following
reasons:
• There are some instances where the most didactic way to show a particular
piece of code conflicts with the style guide. On those few occasions, I choose
to deviate from the style guide in favor of clarity.
• Due to size limitation in a printed book, some names were shortened and
other minor drifts from the coding styles have been introduced.
• To show that there is more than one way to write the same code. Coding
style is a guideline, and enforcement is not made at a language level, so some
programmers don’t follow it thoroughly. You should be able to read “bad”
code, since sooner or later you will have to read other people’s code.
1.2.4 Get the Most from This Book without Reading It All
• If you want to learn how to program, read the first section, from Chapter
1 to Chapter 8. The Regular Expressions (REGEX) chapter (Chapter 13) can
be skipped if you don’t need to deal with REGEX.
• If you know Python and just want to know about Biopython, read first
Chapter 9 (from page 158 to page 209). It is about Biopython modules and
functions. Then try to follow programs found in Section III (from page 315
to page 369).
• There are three appendixes that can be read in an independent way. Appendix
A (Collaborative Development: Version Control with GitHub) reproduces a
paper called “A Quick Introduction to Version Control with Git and GitHub.”
Appendix B shows how to install a web application using Python Anywhere.
Appendix C is a reference material that can be used as a cheat sheet when
you need a quick answer without having to read a chapter.
2
The official Python style guide is located at https://www.python.org/dev/peps/pep-0008,
and a more easy-to-read style guide is located at http://docs.python-guide.org/en/latest/
writing/style.
Introduction 7
Not only can it automate the procedures that we do manually, but it will also
be able to do things that would otherwise not be possible.
Sometimes it is not very clear if a particular task can be done by a program.
Reading a book such as this one (including the examples) will help you identify
which tasks are feasible to automate with software and which ones are better done
manually.
Materials
4
There are declarative languages that state what the program should accomplish, rather than
describing how to accomplish it. Most computer languages (Python included) are imperative instead
of declarative.
Introduction 9
Procedure
1 seq_1 = ’Hello,’
2 seq_2 = ’ you!’
3 total = seq_1 + seq_2
4 seq_size = len(total)
5 print(seq_size)
Note: The numbers at the beginning of the each line are for reference only,
they are not meant to be typed.
This small program can be read as “Name the string Hello, as seq_1. Name
the string you! as seq_2. Add the strings named seq_1 and seq_2 and call the
result as total. Get the length of the string called total and name this value as
seq_size. Print the value of seq_size.” This program prints the number 11.
As shown, there are different types of data (often called “data types” or just
“types”). Numbers (integers or float), text string, and other data types are covered
in Chapter 3. In print(seq_size), the instruction is print and seq_size is the
name of the data. Data is often represented as variables. A variable is a name
that stands for a value that may vary during program execution. With variables,
a programmer can represent a generic command like “round n” instead of “round
2.9.” This way he can take into account a non-fixed (hence variable) value. When
10 Python for Bioinformatics
the program is executed, “n” should take a specific value since there is no way to
“round n.” This can be done by assigning a value to a variable or by binding a name
to a value.5 The difference between “assign a value to a variable” and “bind a name
to a value” is explained in detail in Chapter 3 (from page 64). In any case, it is
expressed as:
variable_name = value
seq_1 = ’Hello’
print(seq_1)
This is translated as “print out the value called seq_1”. This command returns
“Hello” because this is the value of this variable.
• Built-in features: Python comes with “batteries included.” It has a rich and
versatile standard library that is immediately available, without the user hav-
ing to download separate packages. With Python you can, with few lines, read
and write XML and JSON files, parse and generate email messages, extract
files from a zip archive, open a URL as if were a file, and many other possi-
bilities that in other languages, it would require a third-party library.
• High-level built-in data structures: Dictionaries, sets, lists, tuples, and others.
These are very useful to model real-world data. Third-party modules such as
NumPy and SciPy can also extend the structures to kd-trees, n-dimensional
arrays, matrix operations, time-series, image objects, and more.
• Open source: Python has a liberal open source license that makes it freely
usable and distributable, even for commercial use.
• Cross platform: A program made in Python can be run under any computer
that has a Python interpreter. This way, a program made under Windows 10
can run unmodified in Linux or OSX. Python interpreters are available for
most computer and operating systems, and even some devices with embedded
computers like the Raspberry Pi.
can be regarded as a tool, and choosing the best tool for the job makes a lot of
sense.
Readability
Nonprofessional programmers tend to value the learning curve as much as the leg-
ibility of the code (both aspects are tightly related).
A simple “hello world” program in Python looks like this:
print("Hello world!")
Let’s see a code sample in C language. The following program reads a file
(input.txt) and copies its contents into another file (output.txt):
#include <stdio.h>
int main(int argc, char **argv) {
FILE *in, *out;
int c;
in = fopen("input.txt", "r");
out = fopen("output.txt", "w");
while ((c = fgetc(in)) != EOF) {
fputc(c, out);
}
fclose(out);
fclose(in);
}
Let’s see a Perl program that calculates the average of a series of numbers:
Introduction 13
sub avg(@_) {
$sum += $_ foreach @_;
return $sum / @_ unless @_ == 0;
return 0;
}
print avg((1..5))."\n";
def avg(data):
if len(data)==0:
return 0
else:
return sum(data)/len(data)
print(avg([1,2,3,4,5]))
The purpose of this Python program could be almost fully understood by just
knowing English.
Python is designed to be a highly readable language.13 The use of English key-
words, and the use of spaces to limit code blocks and its internal logic (indentation),
contribute to this end. It’s possible to write hard-to-read code in Python, but it
requires a deliberate effort to obfuscate the code.14
Speed
Another criterion to consider when choosing a programming language is execution
speed. In the early days of computer programming, computers were so slow that
some differences due to language implementation were very significant. It could take
a week for a program to be executed in an interpreted language, while the same
code in a compiled language could be executed in a day. This performance difference
between interpreted and compiled languages still has the same proportion, but it
is less relevant. This is because a program that took a week to run, now takes less
than ten seconds, while the compiled one takes about one second. Although the
difference seems important (at least one order of magnitude), it is not so relevant
if we consider the time it takes to develop it.
This does not mean that execution speed does not need to be considered. A 10X
speed difference can be crucial in some high-performance computing operations.
Sometimes a lot of improvements can be achieved by writing optimized code. If the
code is written with speed optimization in mind, it is possible to obtain results quite
13
Other languages are regarded as “write only,” since once written it is very difficult to understand
it.
14
A simple print(’Hello World’) program could be written, if you are so inclined, as
print(”.join([chr((L>=65 and L<=122) and (((((L>=97) and (L-96) or (L-64))-
1)+13)%26+((L>=97) and 97 or 65)) or L) for L in [ ord(C) for C in ’Uryyb Jbeyq!’]]))
(https://goo.gl/r5sm9j).
14 Python for Bioinformatics
similar to the ones that could be obtained in a compiled language. In the cases where
the programmer is not satisfied with the speed obtained by Python, it is possible
to link to an external library written in another language (like C or Fortran). This
way, we can get the best of both worlds: the ease of Python programming with the
speed of a compiled language.
Python is gaining momentum as the default computer language for the scien-
tific community. There are several libraries oriented toward scientific users, such as
SciPy18 and Anaconda.19 Both distributions integrate modules for linear algebra,
15
https://www.python.org/about/success/ilm/
16
https://www.python.org/about/success/dlink/
17
http://wiki.services.openoffice.org/wiki/Python
18
https://www.scipy.org
19
https://www.continuum.io/anaconda-overview
Introduction 15
• CPython: The most used Python version, so the terms CPython and Python
are used interchangeably. It is made mostly in C (with some modules made
20
https://cloud.google.com/appengine/
21
https://www.visualstudio.com/vs/python/
22
http://codespeak.net/pypy/dist/pypy/doc/home.html
23
http://www.stackless.com
24
http://www.jython.org/Project
25
http://ironpython.net
16 Python for Bioinformatics
in Python) and is the version that is available from the official Python Web
site (http://www.python.org).
• PyPy: A Python version made in Python. It was conceived to allow program-
mers to experiment with the language in a flexible way (to change Python
code without knowing C). It is mostly an experimental platform.
• Stackless: Another experimental Python implementation. The aim of this im-
plementation doesn’t focus on flexibility like PyPy; instead, it provides ad-
vanced features not available in the “standard” Python version. This is done in
order to overcome some design decisions taken early in Python development
history. Stackless allows custom-designed Python application to scale better
than CPython counterparts. This implementation is being used in the EVE
Online massively multi-player online game, Civilization IV, Second Life, and
Twisted.
• Jython: A Python version written in Java. It works in a JVM (Java Virtual
Machine). One application of Jython is to add the Jython libraries to their
Java system to allow users to add functionality to the application. A very
well known learning 3D programming environment (Alice26 ) uses Jython to
let the users program their own scripts.
• IronPython: Python version adapted by Microsoft to run on “.Net” and
“.Mono” platform. .Net is a technology that competes with Java regarding
“write once, runs everywhere.”
You may be wondering which one to use (or just use the standard “plain vanilla”
Python). There is no single and correct answer to this question, since it will depend
on your needs, work habits, budget, and personal preferences. Personally I tend
to use the standard Python in servers and Anaconda in the computers I use for
software development.
stand and reproduce. Helen Shen. Nature 515, 151–152 (06 November 2014)
doi:10.1038/515151a
https://goo.gl/HfBJ12