0% found this document useful (0 votes)
26 views

Python For Bioinformatics Cap 1

Uploaded by

celialove17
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Python For Bioinformatics Cap 1

Uploaded by

celialove17
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CHAPTER 1

Introduction
CONTENTS
1.1 Who Should Read This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 What the Reader Should Already Know . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Using this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Typographical Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Python Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Code Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Get the Most from This Book without Reading It All . . . . . . . . . 6
1.2.5 Online Resources Related to This Book . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Why Learn to Program? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Basic Programming Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 What Is a Program? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.1 Main Features of Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.2 Comparing Python with Other Languages . . . . . . . . . . . . . . . . . . . . . 11
Readability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.3 How Is It Used? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.4 Who Uses Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.5 Flavors of Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.6 Special Python Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

The most effective way to do it, is to do it.

Amelia Earhart

1.1 WHO SHOULD READ THIS BOOK


This book is for the life science researcher who wants to learn how to program.
He/she may have previous exposure to computer programming, but this is not
necessary to understand this book (although it surely helps).
This book is designed to be useful to several separate but related audiences,
students, graduates, postdocs, and staff scientists, since all of them can benefit
from knowing how to program.

3
4  Python for Bioinformatics

Exposing students to programming at early stages in their career helps to boost


their creativity and logical thinking, and both skills can be applied in research. In
order to ease the learning process for students, all subjects are introduced with the
minimal prerequisites. There are also questions at the end of each chapter. They
can be used for self-assessing how much you’ve learned. The answers are available
to teachers in a separate guide.
Graduates and staff scientists having actual programming needs should find its
several real-world examples and abundant reference material extremely valuable.

1.1.1 What the Reader Should Already Know


Since this book is called Python for Bioinformatics, it has been written with the
following assumptions in mind:

• No programming knowledge is assumed, but the reader is required to have


minimum computer proficiency to be able to use a text editor and handle basic
tasks in your operating system (OS). Since Python is multi-platform, most
instructions from this book will apply to the most common operating systems
(Windows, macOS and Linux); when there is a command or a procedure that
applies only to a specific OS, it will be clearly noted.

• The reader should be working (or at least planning to work) with bioinfor-
matics tools. Even low-scale handmade jobs, such as using the NCBI BLAST
to ID a sequence, aligning proteins, primer searching, or estimating a phy-
logenetic tree will be useful to follow the examples. The more familiar the
reader is with bioinformatics, the better he will be able to apply the concepts
learned in this book.

1.2 USING THIS BOOK


1.2.1 Typographical Conventions
There are some typographical conventions I have tried to use in a uniform way
throughout the book. They should aid readability and were chosen to tell apart
user-made names (or variables) from language keywords. This comes in handy when
learning a new computer language.
Bold: Objects provided by Python and by third-party modules. With this no-
tation it should be clear that round is part of the language and not a user-defined
name. Bold is also used to highlight parts of the text. There is no way to confuse
one bold usage with the other.
Mono-spaced font: User declared variables, code, and filenames. For example:
sequence = ’MRVLLVALALLALAASATS’.
Italics: In commands, it is used to denote a variable that can take different
values. For example, in len(iterable), “iterable” can take different values. Used in
Introduction  5

text, it marks a new word or concept. For example “One such fundamental data
structure is a dictionary.”
The content of lines starting with $ (dollar sign) are meant to be typed in your
operating system console (also called command prompt in Windows or terminal
in macOS).
←֓ : Break line. Some lines are longer than the available space in a printed
page, so this symbol is inserted to mean that what is on the next line in the page
represents the same line on the computer screen. Inside code, the symbol used is
<=.

1.2.2 Python Versions


The current version of Python at this moment is 3.6.1. There is a 2.7.12 version that
is maintained1 because there are still a sizable number of applications in production
using the 2.7 branch. Versions 3.x and 2.x are slightly different, at the point of
being incompatible. Python 3 is more efficient than Python 2 in many aspects.
Large websites such as Instagram migrated from Python 2.7 to Python 3.6 to save
in CPU and memory consumption by up to 30%. This book uses Python 3.6.
The only scenario where you may need to use Python 2.7, apart from mainte-
nance of old code, is when there is no availability of a specific library for Python
3. In this case, before starting a project in Python 2.7, try to search for a replace-
ment library. For example, you want to connect with a MySQL database and you
are told to use MySQLdb, since this package is not Python 3 compatible; instead
of using Python 2.7, use mysqlclient or mysql-connector-python, both works
with Python 3.

1.2.3 Code Style


Python source code that appears in this book is presented as listings. Each line of
these listings is numbered. These numbers are not intended to be typed; they are
used to reference each line in the text. You don’t need to copy the code from the
book, since it can be downloaded from the GitHub repository at https://github.
com/Serulab/Py4Bio.
Code can be formatted in several ways and still be valid to the Python inter-
preter. This following code is syntactically correct:

def GetAverage(X):
avG=sum(X)/len(X)
" Calculate the average "
return avG

Also this one:


1
Python 2.7.x has an end-of-life date in 2020. There will be no Python 2.8. For more information
see https://www.python.org/dev/peps/pep-0373/.
6  Python for Bioinformatics

def get_average(items):
""" Calculate the average
"""
average = sum(items) / len(items)
return average

The former code sample follows most accepted coding styles for Python.2
Throughout the book you will find mostly code formatted as the second sample.
Some code in the book will not follow accepted coding styles for the following
reasons:

• There are some instances where the most didactic way to show a particular
piece of code conflicts with the style guide. On those few occasions, I choose
to deviate from the style guide in favor of clarity.

• Due to size limitation in a printed book, some names were shortened and
other minor drifts from the coding styles have been introduced.

• To show that there is more than one way to write the same code. Coding
style is a guideline, and enforcement is not made at a language level, so some
programmers don’t follow it thoroughly. You should be able to read “bad”
code, since sooner or later you will have to read other people’s code.

1.2.4 Get the Most from This Book without Reading It All
• If you want to learn how to program, read the first section, from Chapter
1 to Chapter 8. The Regular Expressions (REGEX) chapter (Chapter 13) can
be skipped if you don’t need to deal with REGEX.

• If you know Python and just want to know about Biopython, read first
Chapter 9 (from page 158 to page 209). It is about Biopython modules and
functions. Then try to follow programs found in Section III (from page 315
to page 369).

• There are three appendixes that can be read in an independent way. Appendix
A (Collaborative Development: Version Control with GitHub) reproduces a
paper called “A Quick Introduction to Version Control with Git and GitHub.”
Appendix B shows how to install a web application using Python Anywhere.
Appendix C is a reference material that can be used as a cheat sheet when
you need a quick answer without having to read a chapter.

2
The official Python style guide is located at https://www.python.org/dev/peps/pep-0008,
and a more easy-to-read style guide is located at http://docs.python-guide.org/en/latest/
writing/style.
Introduction  7

1.2.5 Online Resources Related to This Book


The book website is at http://py3.us. In this site you will find errata, a mail-
ing list to keep updated about Python and links to source code repositories. Re-
garding source code, the official source code repository of this book is at GitHub
(https://github.com/Serulab/Py4Bio). From this site you can inspect online
or download all the code used in this book. To download all scripts, go to the
“Clone or download” green button and press it. If you have Git installed in
your machine (and know how to use it3 ), clone the repository using this ad-
dress: [email protected]:Serulab/Py4Bio.git. Another alternative is to click on
“Download ZIP”. Once you have the repository in your machine, go to the code
folder, where there are a set of folders, each one has the scripts related to the
chapter. Each script in the book has a name and this corresponds with the file-
name. There is another folder called notebooks, and it contains Jupyter note-
books that can be run locally. For more information on how to run a Jupyter
notebook, please see http://jupyter-notebook-beginner-guide.readthedocs.
io/en/latest/execute.html.
Another online resource are the Jupyter Notebooks available at Microsoft Azure
Notebook website (https://notebooks.azure.com/py4bio/libraries/py3.us).
The same notebooks that are in the book repository, can be used online in this site.

1.3 WHY LEARN TO PROGRAM?


Many of the tasks that a researcher performs with his or her computer are repetitive:
Collect data from a Web page, convert files from one format to another, execute or
interpret hundreds of BLAST results, primer design, look for restriction enzymes,
etc. In many cases it is evident that these are tasks that can be performed with a
computer, with less effort on our part and without the possibility of errors caused
by tiredness or distractions.
An important consideration when you’re evaluating whether or not to create a
program is the apparent time lost in the definition and formulation of the problem,
implementing it with code, and then debugging it (correcting errors). It is incorrect
to consider problem definition and evaluation as a waste of time. It is generally at
this precise point in the process where we understand thoroughly the problem that
we face. It is common that during the attempt to formulate a problem, we realize
that many of our initial assumptions were mistaken. It also helps us to detect when
it is necessary to restart the planning process. When this happens, it is better that
it happens at the planning stage than when we are in the middle of the project. In
these cases, the planning of the program represents time saved. Another advantage
to take into account is that the time that is invested to create a program once is
compensated by the speed with which the tasks are performed every time we run
it.
3
In Appendix A there is a tutorial on how to use GitHub
8  Python for Bioinformatics

Not only can it automate the procedures that we do manually, but it will also
be able to do things that would otherwise not be possible.
Sometimes it is not very clear if a particular task can be done by a program.
Reading a book such as this one (including the examples) will help you identify
which tasks are feasible to automate with software and which ones are better done
manually.

1.4 BASIC PROGRAMMING CONCEPTS


Before installing Python, let’s review some programming fundamentals. If you have
some previous programming experience, you may want to skip this section and jump
straight to Chapter 2 “Installing Python.” This section introduces basic concepts
such as instructions, data types, variables, and some other related terminology that
is used throughout this book.

1.4.1 What Is a Program?


Computers only know what you tell them. The way to tell them to do something
is by a program. A program is a set of ordered instructions designed to command
the computer to do something. The word “ordered” is there because is not enough
to declare what to do, but the actual order of directions should also be stated.4
A program is often characterized as a recipe. A typical recipe consists of a
list of ingredients followed by step-by-step instructions on how to prepare a dish.
This analogy is reflected in several programming websites and tutorials with the
words “recipe” and “cookbook.” A laboratory protocol is another useful analogy. A
protocol is defined as a “predefined written procedural method in the design and
implementation of experiments.”
Here is a typical protocol, followed almost every day in several molecular labo-
ratories:

Listing 1.1: Protocol for Lambda DNA digestion

Restriction Digestion of Lambda DNA

Materials

5.0 mcL Lambda DNA (0.1 g/L)


2.5 mcL 10x buffer
16.5 mcL H2O
1.0 mcL EcoRI

4
There are declarative languages that state what the program should accomplish, rather than
describing how to accomplish it. Most computer languages (Python included) are imperative instead
of declarative.
Introduction  9

Procedure

Incubate the reagents at 37°C for 1 hr.


Add 2.5 mcL loading dye and incubate for another 15 minutes.
Load 20 mcL of the digestion mixture onto a minigel

There are at least two components of a protocol: materials or ingredients, and


procedures. A procedure provides specific order like incubate, add, mix, load and
many others. The same goes for a computer program. The programmer gives specific
order to the computer: print, read, write, add, multiply, round, and others.
While protocol procedures correlate with program instructions, materials are
the data. In protocols, procedures are applied to materials: Mix 2.5 µL of buffer
with 5 µL of Lambda DNA and 16.5 µL of H2 0, load 20 µL onto a minigel. In a
program, instructions are applied to data: print the text string “Hello”, add two
integer numbers, round a float number.
As a protocol can we written in different languages (like English, Spanish, or
French), there are different languages to program a computer. In science, English is
the de facto language. Due to historical, commercial and practical reasons, there is
no such equivalent in computer science. There are several languages, each with its
own strong points and weakness. For reasons that will make sense shortly, Python
was the computer language chosen for this book.
Let’s see a simple Python program:

Listing 1.2: sample.py: Sample Python Program

1 seq_1 = ’Hello,’
2 seq_2 = ’ you!’
3 total = seq_1 + seq_2
4 seq_size = len(total)
5 print(seq_size)

Note: The numbers at the beginning of the each line are for reference only,
they are not meant to be typed.
This small program can be read as “Name the string Hello, as seq_1. Name
the string you! as seq_2. Add the strings named seq_1 and seq_2 and call the
result as total. Get the length of the string called total and name this value as
seq_size. Print the value of seq_size.” This program prints the number 11.
As shown, there are different types of data (often called “data types” or just
“types”). Numbers (integers or float), text string, and other data types are covered
in Chapter 3. In print(seq_size), the instruction is print and seq_size is the
name of the data. Data is often represented as variables. A variable is a name
that stands for a value that may vary during program execution. With variables,
a programmer can represent a generic command like “round n” instead of “round
2.9.” This way he can take into account a non-fixed (hence variable) value. When
10  Python for Bioinformatics

the program is executed, “n” should take a specific value since there is no way to
“round n.” This can be done by assigning a value to a variable or by binding a name
to a value.5 The difference between “assign a value to a variable” and “bind a name
to a value” is explained in detail in Chapter 3 (from page 64). In any case, it is
expressed as:

variable_name = value

Note that this is not an equality as seen in mathematics. In an equality,


terms can be interchanged, but in programming, the term on the right (value)
takes the name of the term on the left (variable_name). For example,

seq_1 = ’Hello’

After this assignment, the variable seq_1 can be used, like,

print(seq_1)

This is translated as “print out the value called seq_1”. This command returns
“Hello” because this is the value of this variable.

1.5 WHY PYTHON?


Let’s have a look at some Python features worth pointing out.

1.5.1 Main Features of Python


• Readability: When we talk about readability, we refer as much to the original
programmer as any other person interested in understanding the code. It is
not an uncommon occurrence for someone to write some code then return
to it a month later and find it difficult to understand. Sometimes Python is
called a “human-readable language.”

• Built-in features: Python comes with “batteries included.” It has a rich and
versatile standard library that is immediately available, without the user hav-
ing to download separate packages. With Python you can, with few lines, read
and write XML and JSON files, parse and generate email messages, extract
files from a zip archive, open a URL as if were a file, and many other possi-
bilities that in other languages, it would require a third-party library.

• Availability of third-party modules for a broad spectrum of activities. Data


visualization6 and plotting, PDF generation, bioinformatics analysis,7 image
5
In Python the latter form is used.
6
MatPlotLib (http://matplotlib.org/) and Bokeh http://bokeh.pydata.org/en/latest/
are the most used.
7
Biopython library to make your own bioinformatics applications (http://biopython.org/).
Introduction  11

processing,8 machine learning,9 game development, interface with popular


databases,10 and application software are only a handful of examples of mod-
ules that can be installed to extend Python functionality.

• High-level built-in data structures: Dictionaries, sets, lists, tuples, and others.
These are very useful to model real-world data. Third-party modules such as
NumPy and SciPy can also extend the structures to kd-trees, n-dimensional
arrays, matrix operations, time-series, image objects, and more.

• Multiparadigm: Python can be used as a “classic” procedural language or as


“modern” object-oriented programming (OOP) language. Most programmers
start writing code in a procedural way and when they need to, they upgrade
to OOP. Python doesn’t force programmers to write OOP code when they
just want to write a simple script.

• Extensibility: If the built-in methods and available third-party modules are


not enough for your needs, you can easily extend Python, even in other pro-
gramming languages. There are some applications written mostly in Python
but with a processor demanding routine in C or FORTRAN. Python can also
be extended by connecting it to specialized high-level languages like R or
MATLAB11 .

• Open source: Python has a liberal open source license that makes it freely
usable and distributable, even for commercial use.

• Cross platform: A program made in Python can be run under any computer
that has a Python interpreter. This way, a program made under Windows 10
can run unmodified in Linux or OSX. Python interpreters are available for
most computer and operating systems, and even some devices with embedded
computers like the Raspberry Pi.

• Thriving community: Python is nowadays the programming language to use


for scientists and researchers.12 This translates into more libraries for your
projects and people you can go to for support.

1.5.2 Comparing Python with Other Languages


You may be wondering why you should use Python, and not more well-known
languages like Java, PHP, or C++. It is a good question. A programming language
8
Scikit-image paper: http://peerj.com/articles/453
9
scikit-learn website: http://scikit-learn.org/stable/
10
https://wiki.python.org/moin/DatabaseProgramming
11
MATLAB® is a registered trademark of The MathWorks, Inc. For product information please
contact: The MathWorks, Inc. 3 Apple Hill Drive Natick, MA, 01760-2098 USA. Tel: 508-647-7000.
Fax: 508-647-7001. E-mail: [email protected]. Web: www.mathworks.com.
12
http://www.nature.com/news/programming-pick-up-python-1.16833
12  Python for Bioinformatics

can be regarded as a tool, and choosing the best tool for the job makes a lot of
sense.

Readability
Nonprofessional programmers tend to value the learning curve as much as the leg-
ibility of the code (both aspects are tightly related).
A simple “hello world” program in Python looks like this:

print("Hello world!")

Compare it with the equivalent code in Java:

public class Hello


{
public static void main(String[] args) {
System.out.printf("Hello world!");
}
}

Let’s see a code sample in C language. The following program reads a file
(input.txt) and copies its contents into another file (output.txt):

#include <stdio.h>
int main(int argc, char **argv) {
FILE *in, *out;
int c;
in = fopen("input.txt", "r");
out = fopen("output.txt", "w");
while ((c = fgetc(in)) != EOF) {
fputc(c, out);
}
fclose(out);
fclose(in);
}

The same program in Python is shorter and easier to read:

with open("input.txt") as input_file:


with open("output.txt") as output_file:
output_file.writelines(in)

Let’s see a Perl program that calculates the average of a series of numbers:
Introduction  13

sub avg(@_) {
$sum += $_ foreach @_;
return $sum / @_ unless @_ == 0;
return 0;
}
print avg((1..5))."\n";

The equivalent program in Python is:

def avg(data):
if len(data)==0:
return 0
else:
return sum(data)/len(data)
print(avg([1,2,3,4,5]))

The purpose of this Python program could be almost fully understood by just
knowing English.
Python is designed to be a highly readable language.13 The use of English key-
words, and the use of spaces to limit code blocks and its internal logic (indentation),
contribute to this end. It’s possible to write hard-to-read code in Python, but it
requires a deliberate effort to obfuscate the code.14

Speed
Another criterion to consider when choosing a programming language is execution
speed. In the early days of computer programming, computers were so slow that
some differences due to language implementation were very significant. It could take
a week for a program to be executed in an interpreted language, while the same
code in a compiled language could be executed in a day. This performance difference
between interpreted and compiled languages still has the same proportion, but it
is less relevant. This is because a program that took a week to run, now takes less
than ten seconds, while the compiled one takes about one second. Although the
difference seems important (at least one order of magnitude), it is not so relevant
if we consider the time it takes to develop it.
This does not mean that execution speed does not need to be considered. A 10X
speed difference can be crucial in some high-performance computing operations.
Sometimes a lot of improvements can be achieved by writing optimized code. If the
code is written with speed optimization in mind, it is possible to obtain results quite
13
Other languages are regarded as “write only,” since once written it is very difficult to understand
it.
14
A simple print(’Hello World’) program could be written, if you are so inclined, as
print(”.join([chr((L>=65 and L<=122) and (((((L>=97) and (L-96) or (L-64))-
1)+13)%26+((L>=97) and 97 or 65)) or L) for L in [ ord(C) for C in ’Uryyb Jbeyq!’]]))
(https://goo.gl/r5sm9j).
14  Python for Bioinformatics

similar to the ones that could be obtained in a compiled language. In the cases where
the programmer is not satisfied with the speed obtained by Python, it is possible
to link to an external library written in another language (like C or Fortran). This
way, we can get the best of both worlds: the ease of Python programming with the
speed of a compiled language.

1.5.3 How Is It Used?


Python has a wide range of applications. From cell phones to web servers, there
are thousands of Python applications in the most diverse fields. There is Python
code powering Wikipedia robots, helping design next generation special effects at
Industrial Light & Magic,15 embedded in D-link modems and routers,16 and it is
the scripting language of the OpenOffice suite17 .
Some languages are strong in one niche (like PHP for web applications, Java for
desktop programs), but Python can’t be typecast easily.
With a single code base, Python desktop applications run with a native look
and feel on multiple platforms. Well-known examples of this category include the
BitTorrent p2p client/server, Calibre, an Ebook manager, Sage Math (a math-
ematics software system), the Dropbox client, and more.
As a language for building web applications, Python can be found in high traffic
sites like Reddit, NationalGeographic, Instagram, and NASA. There are specialized
software for building web sites (called webframeworks) in Python like Django,
Web2Py, Pyramid, Flask, and Bottle.
From system administration to data analysis, Python provides a broad range of
tools to this end:

• Generic Operating System Services (os, io, time, curses)

• File and Directory Access (os.path, glob, tempfile, shutil)

• Data Compression and Archiving (zipfile, gzip, bz2)

• Interprocess Communication and Networking (subprocess, socket, ssl)

• Internet (email, mimetools, rfc822, cgi, urllib)

• String Services (string, re, codecs, unicodedata)

Python is gaining momentum as the default computer language for the scien-
tific community. There are several libraries oriented toward scientific users, such as
SciPy18 and Anaconda.19 Both distributions integrate modules for linear algebra,
15
https://www.python.org/about/success/ilm/
16
https://www.python.org/about/success/dlink/
17
http://wiki.services.openoffice.org/wiki/Python
18
https://www.scipy.org
19
https://www.continuum.io/anaconda-overview
Introduction  15

signal processing, optimization, statistics, genetic algorithms, interpolation, ODE


solvers, special functions, etc.
Python has support for parallel programming with pyMPI and 2D/3D scientific
data plotting.
Python is known to be used in wide and diverse fields like engineering, electron-
ics, astronomy, biology, paleomagnetism, geography, and many more.

1.5.4 Who Uses Python?


Python is used by several companies, from small and unknown shops up to big
players in their fields like Google, National Geographic, Disney, NASA, NYSE, and
many more.
It is one of the four “official languages” of Google among Java, C++ and Go.
They have web sites made in Python, stand-alone programs and even hosting so-
lutions.20 As a confirmation that Google is taking Python seriously, in December
2005 they hired Guido van Rossum, the creator of Python. It may not be Google’s
main language, but this shows that they are a strong supporter of it.
Even Microsoft, a company not known for their support of open source pro-
grams, developed a version of Python to run their “.Net” platform (IronPython)
and also developed a the Python Tools for Visual Studio,21 a Free, open source
plugin that turns Visual Studio into a Python IDE.
Many well-known Linux distributions already use Python in their key tools.
Ubuntu Linux “prefers the community to contribute work in Python.” Python is so
tightly integrated into Linux that some distributions won’t run without a working
copy of Python.

1.5.5 Flavors of Python


Although in this book I refer to Python as a programming language, Python is
actually a language definition. What we use most of the time is a specific imple-
mentation, CPython, that is the Python language definition implemented in C.
Since this implementation is the most used, we just call Python to the CPython
implementation.
The most relevant Python implementations are: CPython, PyPy,22 Stackless,23
Jython24 and IronPython.25 This book will focus on the standard Python version
(CPython), but it is worth knowing about the different versions.

• CPython: The most used Python version, so the terms CPython and Python
are used interchangeably. It is made mostly in C (with some modules made
20
https://cloud.google.com/appengine/
21
https://www.visualstudio.com/vs/python/
22
http://codespeak.net/pypy/dist/pypy/doc/home.html
23
http://www.stackless.com
24
http://www.jython.org/Project
25
http://ironpython.net
16  Python for Bioinformatics

in Python) and is the version that is available from the official Python Web
site (http://www.python.org).
• PyPy: A Python version made in Python. It was conceived to allow program-
mers to experiment with the language in a flexible way (to change Python
code without knowing C). It is mostly an experimental platform.
• Stackless: Another experimental Python implementation. The aim of this im-
plementation doesn’t focus on flexibility like PyPy; instead, it provides ad-
vanced features not available in the “standard” Python version. This is done in
order to overcome some design decisions taken early in Python development
history. Stackless allows custom-designed Python application to scale better
than CPython counterparts. This implementation is being used in the EVE
Online massively multi-player online game, Civilization IV, Second Life, and
Twisted.
• Jython: A Python version written in Java. It works in a JVM (Java Virtual
Machine). One application of Jython is to add the Jython libraries to their
Java system to allow users to add functionality to the application. A very
well known learning 3D programming environment (Alice26 ) uses Jython to
let the users program their own scripts.
• IronPython: Python version adapted by Microsoft to run on “.Net” and
“.Mono” platform. .Net is a technology that competes with Java regarding
“write once, runs everywhere.”

1.5.6 Special Python Distributions


Apart from Python implementations, there are some special adaptations of the
original CPython that are packaged for specific purposes. They are called Python
bundles or distributions. Most of them brings to the table 3rd party software such as
editors, visualization modules and the Jupyter Notebook. This is a web application
that allows you to create and share documents that contain live code, equations,
visualizations and explanatory text. Here is a list of most useful distributions27:

• ActivePython:28 Aimed at enterprise users, ActiveState provides a precom-


piled, supported, quality-assured Python distribution that makes it easy for
corporations to comply with policy requirements to have supported open
source products. From a technical standpoint it offers all modern Python
versions with most used external modules already pre-installed. It also has its
own package management and external modules repository (PyPM29 )
26
Alice is available for free at http://www.alice.org.
27
For a complete list of Python implementations and distributions see https://www.python.
org/download/alternatives
28
http://www.activestate.com/activepython
29
https://code.activestate.com/pypm/
Introduction  17

• Enthought Canopy:30 Another all-in-one Python solution. Includes over


450 core scientific analytic and Python packages, like NumPy, SciPy, IPython,
2D and 3D visualization, database adapters, and others. Also includes a Code
Editor with Jupyter Notebook Support. It has some add ons such a graphical
package manager that notifies you of updates, installs with one click and
helps you roll back package versions. Everything is available as a single-click
installer for the three major operating systems. This bundle is suitable for
scientific users, and it is made by the same people who made NumPy and
SciPy. There are different licenses like a free academic one, and various paid
commercial enterprise licenses.

• WinPython:31 It defines itself as a free open-source portable distribution


of the Python programming language for Windows 7/8/10 and scientific and
educational usage. Also includes packages suitable for scientists, data scien-
tists, and education (NumPy, SciPy, Sympy, Matplotlib, Pandas, pyqtgraph,
etc.). Uses Spyder (Scientific PYthon Development EnviRonment) as the de-
fault editor and it is portable in the sense the user can move the WinPython
directory and all settings are kept. You can have multiples copies of isolated
and self-consistent WinPython installations.

• Anaconda:32 A Python and R distribution for scientific computing. Includes


over 720 packages for data preparation, data analysis, data visualization, ma-
chine learning, and interactive data science. It shares the objective and user
type with Enthought Canopy. Also comes with Spyder as the default code
editor. It has several products that differentiates it from other Python dis-
tribution, like Repository, Accelerate, Scale, Mosaic, Notebooks and Fusion.
Most of these services are available only to the expensive subscriptions. If
you don’t use any of these services you still get an excellent all-in-one Python
distribution. Continuum, the company behind Anaconda is a institutional
partner of Project Jupyter, which means that they support the development
of Jupyter Notebook, a web application to run Python code in a browser.

You may be wondering which one to use (or just use the standard “plain vanilla”
Python). There is no single and correct answer to this question, since it will depend
on your needs, work habits, budget, and personal preferences. Personally I tend
to use the standard Python in servers and Anaconda in the computers I use for
software development.

1.6 ADDITIONAL RESOURCES


• Interactive notebooks: Sharing the code. Interactive notebooks: Sharing the
code. The free IPython notebook makes data analysis easier to record, under-
30
https://www.enthought.com/products/canopy/
31
http://winpython.github.io/
32
https://www.continuum.io/anaconda-overview
18  Python for Bioinformatics

stand and reproduce. Helen Shen. Nature 515, 151–152 (06 November 2014)
doi:10.1038/515151a
https://goo.gl/HfBJ12

• Python for feature film:


http://dgovil.com/blog/2016/11/30/python-for-feature-film/

• Alternative Python implementations:


https://www.python.org/download/alternatives/

• IPython: an interactive computing environment.


http://ipython.org/

• bpython: A fancy interface to the Python interpreter for Unix-like operating


systems:
https://www.bpython-interpreter.org

• Python history, a blog by Guido van Rossum:


http://python-history.blogspot.com

You might also like