Python Course in Bioinformatics
Python Course in Bioinformatics
net/publication/250854301
Article
CITATIONS READS
2 6,157
2 authors:
Some of the authors of this publication are also working on these related projects:
Systems biology and simulation and analysis of cellular pathway dynamics View project
All content following this page was uploaded by Catherine Letondal on 13 October 2014.
The picture above represents the 3D structure of the Human Ferroxidase [http://srs.ebi.ac.uk/srs6bin/cgi-
bin/wgetz?-id+4SU6q1IomZ3+-e+[SWALL:’CERU_HUMAN’]] protein, that we use in some of the exercises
in this course.
This course is designed for biologists who already have some programming knowledge, in other languages
such as perl or C. For this reason, while presenting a substantial introduction to the Python language,
it does not constitute an introduction to programming itself (as [Tis2001] or our course in informatics
for biology [http://www.pasteur.fr/formation/infobio/infobio-en.html], with an online programming course
[http://www.pasteur.fr/formation/infobio/python/] in Python). What distinguishes this course from general Python
introductory courses, is however the important focus on biological examples that are used throughout the course,
as well as the suggested exercises drawn from the field of biology. The second half of the course describes the
Biopython (http://www.biopython.org/) set of modules. This course can be considered a complement to the
Biopython tutorial, and what’s more often refers to it, by bringing practical exercises using these components.
2. from a file:
If file mydna.py contains:
#! /local/bin/python
dna = ’gcatgacgttattacgactctgtcacgccgcggtgcgactgaggcgtggcgtctgctggg’
print dna
caroline:~> ./mydna.py
gcatgacgttattacgactctgtcacgccgcggtgcgactgaggcgtggcgtctgctggg
caroline:~> python
Python 2.2.1c1 (#1, Mar 27 2002, 13:20:02)
[GCC 2.95.4 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> execfile(’mydna.py’)
gcatgacgttattacgactctgtcacgccgcggtgcgactgaggcgtggcgtctgctggg
or to load a file from the command line before entering Python in interactive mode (-i):
1
Chapter 1. General introduction
this is very convenient when your Python file contains definitions (functions, classes,...) that you want to test
interactively.
#include <Python.h>
Py_Initialize();
PyRun_SimpleString("dna = ’atgagag’ + ’tagagga’");
PyRun_SimpleString("print ’Dna is:’, dna");
return 0;
}
1.2. Documentation
1.2.1. General informations
General informations about Python and BioPython can be found:
• in the Python tutorial [http://www.python.org/doc/2.2.1/tut/tut.html] written by Guido van Rossum, the author
of the Python language.
• in “The Python - Essential Reference” book ([Beaz2001]) - a compact but understandable reference guide
2
Chapter 1. General introduction
The pydoc command and the help() function provided with a string argument search the PYTHONPATH for an
object of this name. But the help() function can also be applied directly on an object.
>>> help(ambiguous_dna_alphabet)
Help on function ambiguous_dna_alphabet in module __main__:
ambiguous_dna_alphabet()
returns a string containing all ambiguous dna bases
• by the function dir(obj) which displays the names defined in the local namespace (see Section 4.3.1) of
the object obj. If no argument is specified dir shows the definitions of the current module.
>>> dir()
[’__builtins__’, ’__doc__’, ’__name__’]
3
Chapter 1. General introduction
>>> dir()
[’__builtins__’, ’__doc__’, ’__name__’, ’dna’]
>>> ambiguous_dna_alphabet.__doc__
’ returns a string containing all ambiguous dna bases ’
>>> help(ambiguous_dna_alphabet)
Help on function ambiguous_dna_alphabet in module __main__:
ambiguous_dna_alphabet()
returns a string containing all ambiguous dna bases
If a string is enclosed by triple quotes or triple double-quotes it can span several lines and the line-feed characters
are retained in the string.
Whithin this emacs mode, from the "Python" menu, you can start an interactive interpreter session or (re)execute
the python buffer, functions and classes definitions.
4
Chapter 1. General introduction
Important
The python-mode is very useful because it resolves indentation problems occurring if tab and space
characters are mixed (see Section 3.1.2).
Caution
You can copy-paste a block of correct indented code into an interactive interpreter session. But take care,
that the block does not contain empty lines.
5
Chapter 1. General introduction
6
Chapter 2. Introduction to basic types in Python
>>> len(dna)
103
>>> ’n’ in dna
0
>>> count(dna, ’a’)
10
>>> replace(dna, ’a’, ’A’)
’gcAtgAcgttAttAcgActctgtcAcgccgcggtgcgActgAggcgtggcgtctgctgggcctttActt
cgcctccgcgccctgcAttccgttcctggcctcg’
7
Chapter 2. Introduction to basic types in Python
Go to
See Section 6.2 and work on Section 6.3 before you continue here.
>>> EcoRI[0]
’g’ ❶
>>> EcoRI[-1]
’c’
>>> EcoRI[1:3]
’aa’
>>> EcoRI[3:]
’ttc’
>>> EcoRI[:]
’gaattc’
>>> EcoRI[:-1]
’gaatt’
❷
>>> EcoRI[1:100]
’aattc’ ❷
>>> EcoRI[3:1]
” ❷
>>> EcoRI[100:101]
”
❷ Caution
If one of the start or end specification of a slice is out of range it is ignored. The result is empty if both are
out of range or incompatible with each other.
❶ Negative indices access strings from the end.
8
Chapter 2. Introduction to basic types in Python
Caution
Positive numbering starts with 0 but negative numbering with -1.
9
Chapter 2. Introduction to basic types in Python
874
❶ If no match is found find returns -1 whereas index produce an error (For more explanations on exceptions
see Chapter 8).
Go to
Work on the exercises in Section 5.2 to answer this question.
2.2. Lists
Lists are arbitrary collections of objects that can be nested. They are created by enclosing the comma separated
items in square brackets. As strings they can be indexed and sliced, but as opposite to strings, it is also possible to
modify them.
10
Chapter 2. Introduction to basic types in Python
>>> digest ❸
[’gaattc’]
>>> digest.append(EcoRI)
>>> digest
[’gaattc’, ’ggatcc’, ’aagctt’, ’gaattc’]
>>> digest.pop()
’gaattc’
>>> digest
[’gaattc’, ’ggatcc’, ’aagctt’]
>> digest.reverse()
>>> digest
[’aagctt’, ’ttcgaa’, ’ggatcc’, ’gaattc’]
❶ list creation
❷ replace an element or a slice
❸ deletion of an element
Caution
This merges the two list whereas the method append() includes its argument in the list.
❺ insertion of an element
>>> range(3)
[0, 1, 2]
>>> range(10,20,2)
[10, 12, 14, 16, 18]
>>> range(5,2,-1)
[5, 4, 3]
11
Chapter 2. Introduction to basic types in Python
>>> l = list(’atgatgcgcccacgtacga’)
[’a’, ’t’, ’g’, ’a’, ’t’, ’g’, ’c’, ’g’, ’c’, ’c’, ’c’, ’a’, ’c’, ’g’,
’t’, ’a’, ’c’, ’g’, ’a’]
The next example generates all possibilities of digests using two enzymes from a list of enzymes. It is more
complex and use a nested list and the range function introduced above (Example 2.5).
digests = []
for i in range(len(enzymes)):
for k in range(i+1, len(enzymes)):
digests.append( [enzymes[i], enzymes[k]] )
return digests
❶ If the first statement of a function definition is a string, this string is used as documentation (see Section
1.2.3).
12
Chapter 2. Introduction to basic types in Python
Tip
For each enzyme you need two informations, the restriction pattern and the position where the enzyme
cuts its pattern. You can model an enzyme as a list containing this two informations, for example:
EcoRI = [ ’gaattc’, 1 ]
Tip
If you to do something with list, try to find out if there is a method of list objects that even implements
your task. You can use the dir function to get all methods of a list object (Section 1.2.2).
Go to
Before you continue see Section 4.2 to get a deeper inside in variable assignments and read also Section
6.4 that explain how arguments can be passed to the function parameters.
2.3. Tuples
13
Chapter 2. Introduction to basic types in Python
Tuples are like lists but they can not be modified. Items have to be enclosed by parentheses instead of square
brackets to create a tuple instead of a list. In general all that can be done using tuples can be done with lists, but
sometimes it is more secure to prevent internal changes.
An appropriate use of tuples in a biological example could be the 3D-coordinates of an atom in a structure.
The example calculates distances between atoms in protein structures. Atoms are represented as tuples of their
coordinates x,y,z in space.
but:
Caution
When you create a tuple with only one value, a comma has to follow the value. This is necessary to make
difference with parentheses that group expression. Look at the following example:
14
Chapter 2. Introduction to basic types in Python
Note
Tuples are used internally to pass arguments to the string format operator % (Section 6.2.4) and to pass a
variable number of arguments to a function ( Section 6.6).
Go to
Follow the last links in the note above to learn how you can pass a variable list of arguments to a function.
You can also look at Section 4.1.1 which describes a special syntax of assignments using tuples.
15
Chapter 2. Introduction to basic types in Python
a
Important
shallow copy (see Example 4.5)
a
equal to the + operator
b
in place operation
Xrange objects has only the method tolist() that returns a list containing all values.
16
Chapter 2. Introduction to basic types in Python
They have a special the operator % (modulo) to format them. (remember Section 6.2.4).
2.4.4. Buffers
Buffers are sequence interfaces to a memory region that treats each byte as a 8-bit character. They can be created by
the buffer(obj [, offset] [, size]) function and share the same memory as the underlying object
obj. This is an type for advanced use, so we will not say more about them.
2.5. Dictionaries
Dictionaries are collections of objects that are accessed by a key. They are created using a comma separated
list of key-value pairs separated by colon enclosed in braces. Example 2.8 shows some examples of dictionary
manipulation and Table 2.5 provides an overview of dictionary methods.
>>> code = {"GLY" : "G", "ALA" : "A", "LEU" : "L", "ILE" : "I",
... "ARG" : "R", "LYS" : "K", "MET" : "M", "CYS" : "C",
... "TYR" : "Y", "THR" : "T", "PRO" : "P", "SER" : "S",
... "TRP" : "W", "ASP" : "D", "GLU" : "E", "ASN" : "N",
... "GLN" : "Q", "PHE" : "F", "HIS" : "H", "VAL" : "V"}
>>> code[’VAL’]
’V’
>>> code.has_key(’NNN’)
0
>>> code.keys()
[’CYS’, ’ILE’, ’SER’, ’GLN’, ’LYS’, ’ASN’, ’PRO’, ’THR’, ’PHE’, ’ALA’,
’HIS’, ’GLY’, ’ASP’, ’LEU’, ’ARG’, ’TRP’, ’VAL’, ’GLU’, ’TYR’, ’MET’]
17
Chapter 2. Introduction to basic types in Python
>>> code.values()
[’C’, ’I’, ’S’, ’Q’, ’K’, ’N’, ’P’, ’T’, ’F’, ’A’, ’H’, ’G’, ’D’, ’L’,
’R’, ’W’, ’V’, ’E’, ’Y’, ’M’]
>>> code.items()
[(’CYS’, ’C’), (’ILE’, ’I’), (’SER’, ’S’), (’GLN’, ’Q’), (’LYS’, ’K’),
(’ASN’, ’N’), (’PRO’, ’P’), (’THR’, ’T’), (’PHE’, ’F’), (’ALA’, ’A’),
(’HIS’, ’H’), (’GLY’, ’G’), (’ASP’, ’D’), (’LEU’, ’L’), (’ARG’, ’R’),
(’TRP’, ’W’), (’VAL’, ’V’), (’GLU’, ’E’), (’TYR’, ’Y’), (’MET’, ’M’)]
>>> one2three = {}
>>> for key,val in code.items():
... one2three[val]= key
...
>>> one2three
{’A’: ’ALA’, ’C’: ’CYS’, ’E’: ’GLU’, ’D’: ’ASP’, ’G’: ’GLY’, ’F’: ’PHE’,
’I’: ’ILE’, ’H’: ’HIS’, ’K’: ’LYS’, ’M’: ’MET’, ’L’: ’LEU’, ’N’: ’ASN’,
’Q’: ’GLN’, ’P’: ’PRO’, ’S’: ’SER’, ’R’: ’ARG’, ’T’: ’THR’, ’W’: ’TRP’,
’V’: ’VAL’, ’Y’: ’TYR’, ’?’: ’?’}
18
Chapter 2. Introduction to basic types in Python
d.setdefaults(key [, val]) same as d.get(key), but if key does not exists sets
d[key] to val
d.popitem() removes a random item and returns it as tuple
a
Important
shallow copy (see Example 4.5)
newprot = ""
for aa in prot.split(sep):
newprot += code.get(aa, "?")
return newprot
>>> prot ="""GLN ALA GLN ILE THR GLY ARG PRO GLU TRP ILE TRP LEU
... ALA LEU GLY THR ALA LEU MET GLY LEU GLY THR LEU TYR
... PHE LEU VAL LYS GLY MET GLY VAL SER ASP PRO ASP ALA
... LYS LYS PHE TYR ALA ILE THR THR LEU VAL PRO ALA ILE"""
>>> three2one(prot)
’QAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVPAI’
19
Chapter 2. Introduction to basic types in Python
Go to
See Section 6.5 to learn more about default values of functional parameters.
Note
Local namespaces of objects, that contains their method and attribute definitions, are implemented as
dictionaries (Section 4.3.1). Another internal use of dictionaries is the possibility to pass a variable list
of parameters using keywords (Example 6.7).
Go to
Remember how to pass a variable number of arguments to a function (Section 6.6) and look how to do
the same using keywords (Example 6.7).
2.6. Numbers
This section provides a short introduction to numbers in Python. Table 2.6 shows all built-in number types
of Python and Example 2.10 shows an example of complex numbers which haves a built-in type in Python.
Arithmetics in Python can be done as expected from pocket calculators.
>>> (3+4j)
(3+4j)
>>> (3+4j) + (4+2j)
(7+6j)
>>> (3+4j).real
3.0
>>> (3+4j).imag
20
Chapter 2. Introduction to basic types in Python
4.0
•a + b
•a * b
•a % b
21
Chapter 2. Introduction to basic types in Python
modifiable non−modifiable
numerique
class_instances
3+4j
1.007
1e−10 flottants
seq1
imaginaires
entiers
entiers
structure1 longs
12
1212L
tuples
[ ’BACR_HALHA,
mapping ’BACR_HALHA’ ] ( 14.001, 15.678, 1.999 )
list
dictionaires
string
It is sometimes necessary to convert variables from one type into another. For example, if you need to change some
of the characters of a string, you will have to transform the string in a mutable list. Likewise, see Solution A.1
where it was necessary to convert integers into floating point numbers. Table 2.7 provides the list of all possible
type conversions.
22
Chapter 2. Introduction to basic types in Python
Go to
Read Section 4.3 to get a deeper inside into Python namespaces.
2.8. Files
The open(<filename>, [<mode>]) function opens a file with the specified access rights (see Table 2.9)
and returns a FileType object. Table 2.8 list some of the methods available for FileType objects.
a
See Section 2.4.2 for more informations
23
Chapter 2. Introduction to basic types in Python
f = open("seq.fasta") ❶
entry = get_fasta(f) ❷
while entry:
# ... do what you have to do
entry = get_fasta(f)
f.close() ❸
The second part shows the code of the function get_fasta that reads one sequence from a fasta file.
Reading fasta files is not as simple as reading files in other sequence formats, because there is no explicit end of
a sequence entry. You have to read the start of the following entry to know that your sequence is finished. The
following shows two possibilities to handle this problem while reading the file line per line:
_header = None
def get_fasta(fh):
""" read a fasta entry from a file handle """
global _header ❶
if _header:
header, _header = _header, None
else:
header = fh.readline()
# end of file detection
if not header:
return header
if header[0] != ’>’:
return None
24
Chapter 2. Introduction to basic types in Python
seq = ""
line = fh.readline()
while line and line[0] != ’>’:
seq += line[:-1]
line = fh.readline()
_header = line
❶ Go to
By default all variables are defined in the local namespace. Before looking at the second solution of the
problem, read Section 4.1 for how to differentiate between local and global variables.
The second possibility seeks the current file position to the start of the new entry, before returning the sequence.
So all but the first header lines are read twice:
def get_fasta(fh):
""" read a fasta entry from a file handle """
header = fh.readline()
# eof detection
if not header:
return header
# no fasta format
if header[0] != ’>’:
return None
seq = ""
line = fh.readline()
while line:
if line[0] == ’>’:
# go back to the start of the header line
fh.seek(-len(line), 1)
break
seq += line[:-1]
line = fh.readline()
25
Chapter 2. Introduction to basic types in Python
By default print writes the given string to the standard output and adds a line-feed. If a comma separated list
of strings is given, then all strings will be joined by a single whitespace before printing. The addition of a trailing
comma prevents the line-feed, but in this case a final whitespace is added.
The default destination can be redirected using the special >>file operator where file is the destination
FileType object.
Tip
It is better to exclude the open and close functions to be able to write more than one sequence to a
file.
26
Chapter 3. Syntax rules
Caution
Do not mix tab and space characters. The indentation length is not the length you see in the buffer, but
equal to the number of separation characters.
The python-mode of emacs deals with this issue: if you use tab characters, emacs will replace them by
space characters.
27
Chapter 3. Syntax rules
A block of code is initiated by a colon character followed by the indented instructions of the block. A one line
block can also be given one the same line as the colon character.
>>> if dna.find(primer):
... found = 1
... ’found’
...
’found’
but:
>>> if dna.find(primer):
... found = 1
... ’found’
File "<string>", line 3
’found’
^
SyntaxError: invalid syntax
Statements such as: if, while and def require a block of code containing at least one instruction. If there is
nothing to do in the block, just use the pass statement.
>>> if found:
... pass
28
Chapter 3. Syntax rules
... else:
... ’not found’
...
’not found’
>>> if found:
... else:
File "<string>", line 2
else:
^
IndentationError: expected an indented block
Go back
Return to the function definition section (Section 6.3).
29
Chapter 3. Syntax rules
30
Chapter 4. Variables and namespaces
Caution
The first assignment of a value stands for the variable declaration. If a value is assigned to a variable in a
function body, the variable will be local, even if there is a global variable with the same name, and this
global variable has been used before the assignment.
>>> enz = []
>>> def add_enz(*new):
... enz = enz + list(new)
...
>>> add_enz(’EcoRI’)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 2, in add_enz
UnboundLocalError: local variable ’enz’ referenced before assignment
This rule does not apply in the case of method calls. In the following example, the variable enz is only
used, not assigned, even if enz is actually modified internally.
31
Chapter 4. Variables and namespaces
Go back
Return to the Fasta example (Example 2.11) and go on with the second solution.
Example 4.3.
>>> EcoRI
’gaattc’
>>> BamHI
’ggatcc’
Go back
Return to the end of the introduction to tuples (Section 2.3).
32
Chapter 4. Variables and namespaces
’EcoRI’ ’HindIII’
digest2 ’BamHI’
The same strategy is used for the copy of composed objects. A target object is created and populated by new
references to the items of the source object. Figure 4.2 illustrates what happens in Example 4.5.
>>> newserie[1][0]=’SarI’
>>> newserie
[[’EcoRI’, ’BamHI’], [’SarI’, ’BamHI’]]
>>> firstserie
[[’EcoRI’, ’HindIII’], [’EcoRI’, ’BamHI’], [’SarI’, ’BamHI’]]
33
Chapter 4. Variables and namespaces
’SarI’
newserie[1][0]
newserie[1]
newserie
If an independent copy is needed, the deepcopy function of the copy module should be used.
>>> newserie[1][0]=’SarI’
>>> newserie
[[’EcoRI’, ’BamHI’], [’SarI’, ’BamHI’]]
>>> firstserie
[[’EcoRI’, ’HindIII’], [’EcoRI’, ’BamHI’], [’HindIII’, ’BamHI’]]
Go back
Return to the end of the introduction to the list type (Section 2.3).
4.3. Namespaces
34
Chapter 4. Variables and namespaces
There are three different namespaces in Python: a local namespace, a module namespace and a global namespace.
The latter contains all built-in functions. The module namespace contains all the function definitions and variables
of a module. It can be accessed using the . (dot) operator. A local environment is created at function calls. It
includes all the parameters and local variables of the function. Function definitions can be nested, and nested
functions have their own local namespace.
>>> enz = []
>>> add_enz(’EcoRI’)
enz: []
new: (’EcoRI’,)
>>> enz
[ ’EcoRI’ ]
Caution
This behaviour only exists in Python version 2.2. Previous versions have only one function execution
namespace. In this case, the new variable in Example 4.7 is not accessible within the verif function.
When object methods or attributes are addressed using the . (dot) operator, namespaces searching is different.
Each object has its own local namespace implemented as a dictionary named __dict__. This dictionary is
searched for the name following the . (dot) operator. If it is not found, the local namespace of its class, accessible
via the __class__ attribute, is searched for. If it is not found there, a lookup on the parent classes is performed.
Since modules are objects, accessing the namespace of a module use the same mechanism.
>>> enz.__dict__
35
Chapter 4. Variables and namespaces
36
Chapter 4. Variables and namespaces
Go back
Return to Section 2.7.
37
Chapter 4. Variables and namespaces
38
Chapter 5. Control flow
More complex tests can be written with the and, or and not operators.
>>> base.isalpha() ❶
1
39
Chapter 5. Control flow
❶ Important
Here we ask for the isalpha method of the string object base (see Section 6.2.3).
❷ The object None is the special “empty” object. It is always false.
❸ Some expressions that are false.
❹ A logical expression returns 0 if it is false and the value of the last evaluation otherwise.
❺ Important
The components of the logical expression are evaluated until the value of the entire expression is known.
Here the expression 1/0is not executed because 1 is true and so the entire expression is true.
Go back
Return to Section 6.3 or go directly to Section 3.1.2.
5.2. Loops
The two statements while and for are used to write loops in Python.
5.2.1. while
The while construct executes a block of code while a condition is true.
40
Chapter 5. Control flow
5.2.2. for
The loop construct for iterates over all members of a sequence.
Caution
This is equivalent to the foreach statement in some other programming languages. It is not the same
as the for statement in most other programming languages.
>>> whitespace
’\t\n\x0b\x0c\r ’
>>> dna = """
... aaattcctga gccctgggtg caaagtctca gttctctgaa atcctgacct aattcacaag
... ggttactgaa gatttttctt gtttccagga cctctacagt ggattaattg gccccctgat
... tgtttgtcga agaccttact tgaaagtatt caatcccaga aggaagctgg aatttgccct
... tctgtttcta gtttttgatg agaatgaatc ttggtactta gatgacaaca tcaaaacata
... ctctgatcac cccgagaaag taaacaaaga tgatgaggaa ttcatagaaa gcaataaaat
... gcatggtatg tcacattatt ctaaaacaa """
41
Chapter 5. Control flow
• to execute code only if the loop was not interrupted with break by using the else statement following the
while clause
Caution
The else statement is also executed if the loop is not entered.
found = None
site = dna.find(enz)
else:
if found is not None: ❶
return found
❶ The test ensures that a restriction site occurrence at position 0 is also true.
42
Chapter 5. Control flow
start = -1
while 1:
start = cds.find("atg", start+1)
if start == -1:
break
if start % 3:
continue ❶
❶ The continue statement is used to skip all atg codons that are out of frame.
Go back
Return at the end of Section 2.1.
43
Chapter 5. Control flow
44
Chapter 6. Functions
Chapter 6. Functions
6.1. Some definitions
Function A function is a piece of code that performs a specific sub-task. It takes arguments that are
passed to parameters (special place holders to customise the task) and returns a result.
Operator An operator is a function that takes one or two arguments and that is invoked by the following
syntax: arg1 op arg2.
Note
Operators are defined by special methods in Python:
>>> "atgacta".__add__("atgataga")
’atgactaatgataga’
Procedure The terms "function" and "procedure" are often used as if they would be interchangeable.
However, the role of a procedure is not to return a value, but to perform an action, such
as printing something on the terminal or modifying data (i.e something which is sometimes
called "doing side-effects" in functional programming parlance).
Strictly speaking, the definition of a function is the same as the mathematical definition: given
the same arguments, the result will be identical, whereas the behaviour of a procedure can
vary, even if the task is invoked with the same arguments.
45
Chapter 6. Functions
>>> enznames.index(’BamHI’)
1
>>> enznames.reverse() ❶
>>> enznames
[’HindIII’, ’BamHI’, ’EcoRI’]
❶ The reverse() method executes an inversion of the list enzname. It does it inplace,
and does not construct a new list.
Method A method is a function or procedure that is associated with an object. It executes a task an
object can be asked for. In Python it is called via the . (dot) operator.
>>> dna=’atgctcgctgc’
>>> dna.upper()
’ATGCTCGCTGC’
6.2. Operators
6.2.1. Order of evaluation
Table 6.1 provides the precedence of Python operators. They are listed from the highest to the lowest priority.
Operators listed on the same row have equal priority.
46
Chapter 6. Functions
47
Chapter 6. Functions
Go back
Return to Section 2.1 to continue the introduction to strings.
❶ The Python interpreter displays two different kinds of prompts. The first >>> is the normal one. The second
... indicates the continuation of a block.
❷ Caution
Allthough the name of the argument (dna) is the same as the name of the parameter, their values are not the
same.
48
Chapter 6. Functions
Go to
Read also Section 3.1.2 to learn more about Python syntax. You might need to read Section 5.1 as well
to understand the examples given in the syntax section.
Go back
Return to Section 2.1 to carry on with the introduction to strings.
def remove_ambigous_renz(Lenz):
""" remove enzymes with ambiguous restriction patterns """
for i in range(len(Lenz)):
if not check_dna(Lenz[i]):
del Lenz[i]
>>> remove_ambiguous_renz(renz)
>>> renz
[’gaattc’, ’ggatcc’, ’aagctt’]
49
Chapter 6. Functions
reference during
function execution
Local namespace
Lenz
During the execution of remove_ambiguous_renz(renz) the content of Lenz is modified. Figure 6.1
shows that renz and Lenz refers to the same object and explains why renz is also modified.
or by explicit naming:
One advantage is that you do not have to know in what order parameters are declared in the function.
50
Chapter 6. Functions
Go back
Return to the end of the introduction to the list type (Section 2.3).
>>> blast2(’seq.fasta’)
’blastall -p blastp -d swissprot -i seq.fasta’
51
Chapter 6. Functions
Caution
Be careful if you pass mutable objects as default values. The content of the default value can be modified
after function definition if there is a also a global reference to it.
if params:
for para,value in params.items():
command += " -%s ’%s’" % (para, value)
return command
>>> blast2(’seq.fasta’)
"blastall -p blastp -d swissprot -i seq.fasta -m ’8’ -e ’1.0’ -F ’S 10 1.0 1.5’"
>>> params[’q’]=-6
>>> blast2(’seq.fasta’)
"blastall -p blastp -d swissprot -i seq.fasta -q ’-6’ -m ’8’ -e ’1.0’ -F ’S 10 1.0 1.5’"
It’s risky to keep global references to default values: when using global variables, rather make a deep
copy of the object (see Example 4.6).
52
Chapter 6. Functions
❶ (query,) is a tuple of one element. The comma is necessary because (query) is the syntax to indicate
precedence.
instead of:
Go back
Return to the end of the introduction to tuples (Section 2.3).
Optional variables can also by passed as keywords, if the last parameter is preceded by **. In this case, the
optional variables are placed in a dictionary.
if params:
for para,value in params.items():
command += " -%s ’%s’" % (para, value)
return command
53
Chapter 6. Functions
>>> blast2(’seq.fasta’)
’blastall -p blastp -d swissprot -i seq.fasta’
if params:
for para,value in params.items():
command += " -%s ’%s’" % (para, value)
return command
As for required arguments, you can mix positional and keyword based assignment for optional arguments.
Invoked as:
54
Chapter 6. Functions
Go back
Return to the end of the introduction to dictionaries and carry on with the next section (Section 2.6).
55
Chapter 6. Functions
56
Chapter 7. Functional programming or more about lists
Caution
This chapter is under construction.
57
Chapter 7. Functional programming or more about lists
58
Chapter 8. Exceptions
Chapter 8. Exceptions
8.1. General Mechanism
Exceptions are a mechanism to handle errors during the execution of a program. An exception is raised whenever
an error occurs:
>>> f = open(’my_fil’)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IOError: [Errno 2] No such file or directory: ’my_fil’
try:
f = open(’my_fil’)
except IOError, e:
print e
59
Chapter 8. Exceptions
SyntaxError TabError
IndentationError
SystemError
TypeError
ValueError UnicodeError
• AttributeError: when you attempt to access a non-existing attribute (method or variable) of an object.
• IndexError, KeyError: occurs when attempting to access either an out-of-range index in a list or a
non-existing key in a dictionary
60
Chapter 8. Exceptions
if something_wrong:
raise Exception
if something_wrong:
raise Exception, " something went wrong"
return 1
Go to
Since exceptions are defined as classes and by inheritance, you will need some knowledge about classes
in order to fully understand this section (see Chapter 12).
Example 8.3. Raising your own exception in case of a wrong DNA character
In the following code, you define an exception AlphabetError that can be used when the sequence passed to
the function does not correspond to the alphabet.
class AlphabetError(ValueError): ❶
pass
61
Chapter 8. Exceptions
return 1
62
Chapter 9. Modules and packages
# file Valseq.py
def find_valid_key(e):
for key,value in valid_sequence_dict.items():
if value == e:
return key
import ValSeq
where ValSeq is the module name, and then access to its components, which may be variables, functions, classes,
etc...:
63
Chapter 9. Modules and packages
• in PYTHONHOME, where Python has been installed (at Pasteur it is currently /local/lib/python2.2/),
• in a path, i.e a colon (’:’) separated list of file pathes, stored in the environment variable PYTHONPATH.
• Python files, suffixed by .py (when loaded for the first time, compiled version of the file is stored in the
corresponding .pyc file),
• defined as C extensions,
9.1.2. Loading
When importing a module, for example the dna module you have just created (Exercise 9.2), you "open" its
namespace, which becomes available to your program:
You may also select specific components from the module as "opened" (Figure 9.1):
64
Chapter 9. Modules and packages
In such cases, other components stay hidden, and the namespace is not the one of the module, e.g:
ValSeq ValSeq
valid_sequence_dict valid_sequence_dict
find_valid_key find_valid_key
You can also load "all" the components from a module, which makes them available directly into your code:
Caution
You can restrict the components being imported by an import * statement. The __all__ variable,
also used for packages (Section 9.2), can explicitly list the components to be directly accessible (see
Exercise 9.8).
A module is loaded only once, i.e, a second import statement will not re-execute the code inside the module (see
Python reload statement in the reference guides).
65
Chapter 9. Modules and packages
% python dna.py
the module is executed within the __main__ module (i.e not the dna module):
% python -i dna.py
>>> dna.complement(’aattttt’)
NameError: name ’dna’ is not defined
>>> complement(’aattttt’)
’ttaaaaa’
>>> revcomp(’aattttt’)
’aaaaatt’
>>> dna_translate(’atggacaatttttccgggacgtag’)
’MASPNFSGT*’
For this reason, the code executed at module loading time can be made dependent of the current module name:
if __name__ == ’__main__’:
# statements that you want to be executed only when the
# module is executed from the command line
# (not when importing the code by an import statement)
> seq=Seq("actttgccatatg") ❶
❶ Seq() is a function call that creates an instance of the class Seq, so you need to be able to access to this
component of the Bio.Seq module.
66
Chapter 9. Modules and packages
Go to
This is not required, but you can see Chapter 10 for more explanations.
Solution A.16
9.2. Packages
A package is a set of modules or sub-packages. A package is actually a directory containing either .py files or
sub-directories defining other packages.
The dot (.) operator is used to describe a hierarchy of packages and modules. For instance, the module
Bio.WWW.ExPASy is located in the file PYTHONHOME/site-packages/Bio/WWW/ExPASy.py. This
module belongs to the Bio.WWW package located into the PYTHONHOME/site-packages/Bio/WWW/
directory.
9.2.1. Loading
When loading a package, the __init__.py file is executed. If the __init__.py defines classes, functions,
etc... they become available at once, as shown in the following example:
However, loading a package does not automatically load the inner modules. For instance, even though the
Bio.Fasta package directory contains the following files:
% ls Bio/Fasta
FastaAlign.py FastaAlign.pyc __init__.py __init__.pyc
this does not imply that importing the Bio.Fasta package loads the Bio.Fasta.FastaAlign module:
Issuing:
67
Chapter 9. Modules and packages
will however load the Bio.Fasta.FastaAlign, because this module is mentioned in the __all__ attribute
in the Bio/Fasta/__init__.py file:
__all__ = [
’FastaAlign’,
]
• __name__
• __path__
• __file__
expasy = ExPASy.get_sprot_raw(’CERU_HUMAN’)
sp = SProt.Iterator(expasy, SProt.RecordParser())
record = sp.next()
print record.keywords
Solution A.17
Solution A.18
68
Chapter 9. Modules and packages
Solution A.19
69
Chapter 9. Modules and packages
70
Chapter 10. Classes: Using classes
The actual direct syntax to instantiate a class, i.e to create an instance of class, is by calling a function with the
same name as the class. For instance, the random Python module defines a class Random. In order to create an
instance of this class, i.e an object which generates random numbers, you do:
This creates the object and makes it available from the generator variable. You can now use this object to call
Random class methods:
>>> generator.randrange(100)
75
Sometimes, the instanciation needs arguments, depending on the definition of the class. For instance, there is a
class Netscape defined in the webbrowser Python module. In order to create an instance of this class, i.e
a browser that is able to browse Web documents, you need to pass the path of the netscape program on your
computer:
Now, you can use the browser and open a Web document by:
browser.open(’http://www.biopython.org/’)
Or, if we want to directly create an instance of class Seq, one of the class defined in Biopython to create sequence
objects (see Section 11.3), we do:
71
Chapter 10. Classes: Using classes
seq = Seq(’gcatgacgttattacgactctgtcacgccgcggtgcgacgcgtctgctggg’)
seq = Seq(’MKILILGIFLFLCSTPAWAKEKHYYIGIIETTWDYASDHGEKKLISVDTE’,
alphabet=Alphabet.ProteinAlphabet()
)
pydoc webbrowser.Netscape
See also the embedding module, which might bring additional documentation about related components:
pydoc webbrowser
Attributes of a class also include variables. The variables defined for the instances can however not be listed
by pydoc, since they belong to the instances, not to the class. That is why they should be described in the
documentation string of the class. If they are not, which sometimes happens..., run the Python interpretor and
create an instance, then ask for its dictionary:
>>> seq.data
72
Chapter 10. Classes: Using classes
’gcatgacgttattacgactctgtcacgccgcggtgcgacgcgtctgctggg’
When you consult the documentation of a class with the pydoc command, you get most of the time a list a strange
method names, such as __add__ or __getitem__. These methods are special methods to redefine operators,
and will be explained in the next chapter on classes (Chapter 12, Section 12.2).
73
Chapter 10. Classes: Using classes
74
Chapter 11. Biopython: Introduction
The course about Biopython is divided in two parts, separated by a chapter explaining how to define new classes
in Python (Chapter 12). The first part (present chapter), attempts to cover the use of central components such
as components for sequences (Seq, SeqRecord, SeqFeature), alignments (Blast, Clustalw) and database access
(SwissProt, GenBank). Then, the second part (Chapter 13) presents the main concepts of parsing in Biopython,
associated with exercises to build parsing classes for Enzyme entries. The last part of the Biopython presentation
summarizes several of the exercises provided in this course by the study of disulfid bonds in Human Ferroxidase
3D structure and alignments (see Section 13.2).
11.2. Documentation
• http://www.biopython.org/
See also:
1
Mirror copy of PDF [http://bioweb.pasteur.fr/docs/doc-gensoft/biopython/Doc/Tutorial.pdf].
75
Chapter 11. Biopython: Introduction
4. IV Clustalw
Seq reading
GenBank 8.
mutate running
random fetch GenBank SwissProt ref.
PSSM
find complete CDS computed conserved Cys sites
II 1.
7.
Enzyme 10. compare:
− annotated disulfid bonds (PDB)
build parsing classes − computed disulfids bons (from coordinates)
fetch Enzyme SwissProt ref. − predicted Cys conserved sites (PSSM)
PDB 6. 9.
define a PDBStructure class compute disulfid bonds according to coordinates
defining a class
76
Chapter 11. Biopython: Introduction
Most of string manipulations seen in Section 2.1 are available on Seq objects.
import Bio.Fasta
import sys
handle = open(sys.argv[1])
it = Bio.Fasta.Iterator(handle, Bio.Fasta.SequenceParser())
seq = it.next()
77
Chapter 11. Biopython: Introduction
while seq:
print seq.name
print seq.seq
seq = it.next()
handle.close()
handle = open(sys.argv[1])
it = FASTA.FastaReader(handle)
seq = it.next()
while seq:
print seq.name
print seq.seq
seq = it.next()
handle.close()
78
Chapter 11. Biopython: Introduction
Figure 11.2 describes the classes to handle sequences. A SeqRecord is composed of Seq and SeqFeatures.
Figure 11.2. Seq, SeqRecord and SeqFeatures modules and classes hierarchies
79
Chapter 11. Biopython: Introduction
• SequenceParser class in Bio.Fasta module (used as argument for creating and Iterator in Example
11.2).
• Bio.GenBank parser (not tested and thus not represented in Figure 11.3) (see Section 11.5)
80
Chapter 11. Biopython: Introduction
Bio.SwissProt.SProt Bio.SeqIO
creates an instance of
creates an instance of creates an instance of
takes an instance of
Record
sequence
entry_name
accessions
annotation_update Bio.SeqRecord
description
features
SeqRecord
organism
seq
id = "<unknown id>"
name = "<unknown name>"
description = "<unknown description>"
Bio.Fasta annotations = {}
features = []
Iterator
creates an instance of
next
Bio.Align.Generic
SequenceParser
Alignment
parse creates instances of
get_all_seqs
81
Chapter 11. Biopython: Introduction
mutateseq(seq,span=1000,p=0.01)
Solution A.23
#--------------------------------------------------------
# bar charts of codons frequencies
# - for legibility, 2 charts are built
def codon_sort(a,b):
if a < b:
return -1
elif a > b:
return 1
else:
return 0
labels=count.keys()
labels.sort(codon_sort)
w1=window(plot_title=’Count codons’,width=1000)
y=array(count.values())[:len(count)/2]
x=arange(len(y)+1)
w1.bar(y,x,label=labels[:len(count)/2])
w2=window(plot_title=’Count codons(2)’,width=1000)
y=array(count.values())[(len(count)/2)+1:]
x=arange(len(y)+1)
w2.bar(y,x,label=labels[(len(count)/2)+1:])
82
Chapter 11. Biopython: Introduction
In the following example, the entry is fetched from a local file, provided on the command line:
handle = open(argv[1])
sp = SProt.Iterator(handle, SProt.RecordParser())
record = sp.next()
print record.entry_name
print record.sequence
83
Chapter 11. Biopython: Introduction
Exercise 11.8. Code reading: connecting with ExPASy and parsing SwissProt records
A second example is given by the script swissprot.py [http://bioweb.pasteur.fr/docs/doc-gensoft/biopython/Doc/examples/swissprot.py]
provided within the Biopython distribution.
convert_sp_fasta(’data/ceru_human.sp’, stdout)
Tip
Look again at Example 11.3 which uses the Bio.SeqIO.FASTA module.
Tip
To read the SwissProt entry, you will need to use another parser than in Example 11.5, since the record
you need must be compatable with the FastaWriter class (see Figure 11.3).
Look at Figure 11.3 to understand how to use the FastaWriter class in conjunction with a SwissProt parser.
Solution A.26
Tip
Use the Python os.popen function.
84
Chapter 11. Biopython: Introduction
>>> import re
You then issue a search, for instance in the small sequence seq, by:
To get the occurrences, you can ask for the start and end of the match in the searched text:
Example 11.6. Searching for the occurrence of PS00079 and PS00080 Prosite patterns in
the Human Ferroxidase protein
import sys
import re
from Bio.SwissProt import SProt
sp = open(sys.argv[1])
iterator = SProt.Iterator(sp, SProt.SequenceParser())
seq = iterator.next().seq
sp.close()
PS00079 = ’G.[FYW].[LIVMFYW].[CST].{8,8}G[LM]...[LIVMFYW]’ ❶
p = re.compile(PS00079) ❷
result = p.search(seq.tostring()) ❸
print PS00079
print result.start(), result.end(), seq[result.start():result.end()] ❹
85
Chapter 11. Biopython: Introduction
import sys
import re
from Bio.SwissProt import SProt
sp = open(sys.argv[1])
iterator = SProt.Iterator(sp, SProt.SequenceParser())
seq = iterator.next().seq
sp.close()
PS00080 = ’(?P<copper3>H)CH...H...[AG](?P<copper1>[LM])’ ❶
p = re.compile(PS00080)
result = p.search(seq.tostring())
print PS00080
print result.start(), result.end(), seq[result.start():result.end()]
To get information about the re module, see pydoc, but also the sre module (Support for regular expressions),
for which re is a wrapper.
11.4.3. Prosite
entry = prosite[’PS00079’]
86
Chapter 11. Biopython: Introduction
prosite = Bio.Prosite.ExPASyDictionary()
As you can guess by the name of the module, you actually fetch the Prosite entry on the Web. You could also fetch
the Prosite entry from a local database with the golden program (see Exercise 11.10). The entry fetched above is
actually a string. In order to have the dictionary return a record, you must rather create it like this:
prosite = Bio.Prosite.ExPASyDictionary(parser=Bio.Prosite.RecordParser())
get_prosite_pattern(’PS00079’)
Solution A.29
Write a function get_prosite_refs that extracts the references to Prosite from a SwissProt entry (provided
as a handle) (data [data/ceru_human.sp]).
The functions get_prosite_pattern, defined in Solution A.29 and the function get_prosite_refs can
be used combined to display the patterns of the Prosite references given in a SwissProt entry. Write the statements
to achieve this task.
You can also add these functions in the sprot module (sprot.py).
Solution A.30
The Bio.Prosite package defines a Pattern class that enables to create patterns which may be searched
for in sequences objects, as in the re Python module for regular expressions. The result of a search is a
PrositeMatch, that behaves in a way similar to a regular expression match.
87
Chapter 11. Biopython: Introduction
Exercise 11.14. Search for occurrences of a protein PROSITE patterns in the sequence
Now, you know how to fetch a Prosite entry, how to extract a Prosite reference from a SwissProt entry, and how to
search for pattern occurrences in a sequence. Search for the occurrences of the prosite_refs patterns in the
sequence seq. Display:
Solution A.31
11.5. Bio.GenBank
First, look at the section on GenBank in the Biopython tutorial: http://www.biopython.org/docs/tutorial/Tutorial.
html.
The GenBank.NCBIDictionary may also be combined with a parser, producing either (there is a bug in the
current Biopython release that disables this feature):
• a GenBank.Record instance:
88
Chapter 11. Biopython: Introduction
• or a SeqRecord instance:
11.5.1.2. Iterator
gb_file = argv[1]
gb_handle = open(gb_file, ’r’)
feature_parser = GenBank.FeatureParser()
gb_iterator = GenBank.Iterator(gb_handle, feature_parser)
while 1:
cur_record = gb_iterator.next()
if cur_record is None:
break
print cur_record.seq
rec = get_gbrec(id)
cds = get_complete_cds(rec)
that returns the DNA sequence of the complete CDS. get_gbrec returning a SeqRecord (see examples in
gb_refs.py [exemples/gb_refs.py]).
Solution A.32
89
Chapter 11. Biopython: Introduction
Tip
There is a class which stores the description of the hit, including the Expect value (which is not
necessarily the same as the Expect value of each HSP).
Start from get_prosite_refs and get_prosite_pattern (Exercise 11.13) to get the patterns (these
functions should have been saved in module sprot.py [exercises/sprot.py]). You will then have to provide the
patterns to PHI-Blast in a "hitfile" (-k parameter) to run a PHI-Blast (see Solution A.36).
Solution A.37
90
Chapter 11. Biopython: Introduction
result=run_fasta(query_file, ’gbmam’)
Write the run_fasta, knowing that the appropriate fasta command line to search a protein database is for
instance: fasta_t -q data/ceru_human.fasta /local/databases/fasta/gpmam
Result can be provided as text (i.e not as Python classes). Solution A.38
Tip
Use the Python os.popen function.
11.6.2. Clustalw
import Bio.Clustalw
from Bio.Alphabet import IUPAC
from sys import *
91
Chapter 11. Biopython: Introduction
cline = MultipleAlignCL(argv[1]) ❶
cline.set_output(’data/test.aln’)
print "Command line: ", cline
align = do_alignment(cline) ❷
for seq in align.get_all_seqs(): ❸
print seq.description
print seq.seq
Solution A.39
import Bio.Clustalw
import Bio.Align.AlignInfo
from Bio.Alphabet import IUPAC
from sys import *
92
Chapter 11. Biopython: Introduction
(alignment [data/ceru_human.blastp-edit.aln])
❶ A PSSM is related to one the the sequence in the alignment (here the first one)
Using this PSSM, then display the positions in the alignment that have a percent identity above a given threshold.
Notice positions with conserved cysteins (Cys).
Solution A.41
Use this code and Exercise 11.24 to plot the Cys conserved positions of the alignment.
# ---------------------------------------------------
# plot of Cys positions
#
root = Tk()
frame = Frame(root)
3
See also Chapter 14 on Graphics.
93
Chapter 11. Biopython: Introduction
frame.pack()
g = Pmw.Blt.Graph(frame)
g.pack( expand=1, fill=’both’ )
g.line_create( "percent of identity", xdata=vector_x, ydata=vector_y )
g.configure(width=1000)
g.configure(height=500)
g.element_configure(’percent of identity’, symbol=’none’)
g.axis_configure(’x’, stepsize=100)
❶ Fill this with the code necessary to get tuples containing the plot values (percent of Cys).
Solution A.42
handle = open(sys.argv[1])
it = FASTA.FastaReader(handle)
seq = it.next()
handle.close()
factory = PiseFactory()
cusp = factory.program(’cusp’)
cusp.sequence(seq)
job = cusp.run()
if job.error():
print "Error: " + job.error_message()
else:
print "Output:\n", job.content("outfile.out")
94
Chapter 12. Classes: Defining a new class
• tostring
• tomutable
• count
class Seq: ❶
def tostring(self): ❷
return self.data ❸
def tomutable(self):
return MutableSeq(self.data, self.alphabet)
❶ This method is always called when creating a new instance of class Seq. It’s the constructor.
95
Chapter 12. Classes: Defining a new class
❷ The first argument passed to a class method call is always the object itself. Thus, the first parameter of a
class method must be a variable pointing to the object itself, thus enabling to access to its attributes from
inside the body of the method. self is just a naming convention.
❸ Use of the self variable and the ’.’ (dot) operator to access the data attribute.
struct = PDBStructure()
residue = struct.add_residue(name = "ILE", posseq = 1 )
struct.add_atom(residue, name = "N",
coord = (23.46800041, -8.01799965, -15.26200008))
struct.add_atom(residue, name = "CZ",
coord = (125.50499725, 4.50500011, -19.14800072))
residue = struct.add_residue(name = "LYS", posseq = 2 )
struct.add_atom(residue, name = "OE1",
coord = (126.12000275, -1.78199995, -15.04199982))
print struct.residues
You also might need an __init__ method to initialize the data structures.
Tip
The print statement should return:
96
Chapter 12. Classes: Defining a new class
• Add 2 variables model_id chain_id to the residue, in order to store the model ID (a PDB entry may have
more than one model) and the chains.
• Add fields in the atom: tempfactor to store the factor of temperature, occupancy, altloc, for an
alternate location, and element, which is the chemical name, e.g “C” (while name, e.g “CG2”, is the
chemical name plus the position).
struct = PDBStructure()
residue = struct.add_residue(model_id="1", chain_id="A",
name = "ILE", posseq = 1 )
struct.add_atom(residue, name = "N",
coord = (23.46800041, -8.01799965, -15.26200008),
tempfactor=169.09, occupancy = 1.0,
element = "N")
(Solution A.44)
• to retrieve the model and chain a residue belongs to (methods residue_model and residue_chain)
97
Chapter 12. Classes: Defining a new class
(Solution A.45)
• __add__: defines +
• __sub__: defines -
• __str__: defines how to convert the instance to a string representation (for e.g print statement)
• etc...
98
Chapter 12. Classes: Defining a new class
• etc...
class Seq:
def __repr__(self):
return "%s(%s, %s)" % (self.__class__.__name__,
repr(self.data),
repr(self.alphabet))
def __str__(self):
if len(self.data) > 60:
s = repr(self.data[:60] + " ...")
else:
s = repr(self.data)
return "%s(%s, %s)" % (self.__class__.__name__, s,
repr(self.alphabet))
def __len__(self): return len(self.data)
99
Chapter 12. Classes: Defining a new class
12.3. Inheritance
Example 12.3. biopython FastaAlignment class
The following code shows the definition of the class FastaAlignment in module Bio.Fasta.FastaAlign.
This class inherits from class Bio.Align.Generic.Alignment, which defines generic methods for
alignments.
class FastaAlignment(Alignment): ❶
def __init__(self, alphabet = Alphabet.Gapped(IUPAC.ambiguous_dna)):
Alignment.__init__(self, alphabet) ❷
def __str__(self):
"""Print out a fasta version of the alignment info."""
return_string = ”
for item in self._records:
new_f_record = Fasta.Record()
new_f_record.title = item.description
new_f_record.sequence = item.seq.data
# have a extra newline, so strip two off and add one before returning
return string.rstrip(return_string) + "\n"
100
Chapter 12. Classes: Defining a new class
class CodonTable:
nucleotide_alphabet = Alphabet.generic_nucleotide ❶
protein_alphabet = Alphabet.generic_protein
❶ Class variables definition (notice that they are declared outside of the methods). Notice that these definitions
also use Alphabet generic_nucleotide and generic_protein class variables.
❷ Instance variables definition (initialized at __init()__). Instance variables are initialized with default
values that are provided either by the class variables, or by the parameters.
101
Chapter 12. Classes: Defining a new class
struct = PDBStructure()
self.verbose(1)
You can then add some verbose code into the methods of the class, such as:
If you have some time left, you can try to use the Python Numeric [http://www.python.org/topics/scicomp/numpy.html]
module for computing with multi-dimensionals arrays - see also "Python for Scientific Computing" (PPT
[http://www.python9.org/p9-jones.ppt]) for a presentation at Python 9 Conference [http://www.python9.org/].
Solution A.47
102
Chapter 13. Biopython, continued
The aim of this section is not to describe all these tools, but rather to explain how they work, in order for
you to be able to build one for your own programs or databases, or for programs not having their parser,
yet (see also Parser.txt [http://bioweb.pasteur.fr/docs/doc-gensoft/biopython/Doc/Parser.txt] from the Biopython
documentation for a design overview of parsers). For this purpose, we will build a simple parser for the Enzyme
[http://www.expasy.ch/enzyme/] database, that is only able to store the enzyme ID and the references of an entry
to other databases. There is already a Bio.Enzyme package in Biopython, which defines a _Scanner class,
that we are going to use, but no parser (yet).
In Biopython, parsing is often organized according to an event/callback model, one component, often called the
scanner, generating parsing events when encountering a tag, and another component, often called the consumer,
receiving and handling these events. Generally, you feed data to scan to the scanner through a handle, which can
be an open file or an http stream.
scanner.feed(handle)
The scanner has to know about which consumer to call, which can be achieved by having a standard consumer for
a given type of data. You can also provide a consumer as a parameter to the scanner:
scanner.feed(handle, consumer)
This architecture has the advantage of dividing the tasks of scanning the text and deciding what to do with the
recognized text elements. Writing your own consumer enables you to build your own data structures.
At a higher level, a parser component may wrap the two other components in one class, providing a simpler
component to the programmer, since he or she just has to call the parser:
parser.parse(handle)
103
Chapter 13. Biopython, continued
class Parser(AbstractParser): ❶
def __init__(self):
self._scanner = _Scanner() ❷
self._consumer = _Consumer()
❶ AbstractParser: see below for explanation about Biopython classes to support parsing.
❷ Wrapped scanner and consumer.
Biopython provides a support for defining new parsers and consumers classes through the
Bio.ParserSupport module (AbstractParserand AbstractConsumer classes).
• 2 consumers:
• class Bio.SwissProt.SProt._RecordConsumer
• and Bio.SwissProt.SProt._SequenceConsumer
104
Chapter 13. Biopython, continued
105
Chapter 13. Biopython, continued
Using the SequenceParser, you get Seq objects, while using the RecordParser, you get SeqRecord
objects.
fh = open(argv[1])
sp = SProt.Iterator(fh, SProt.RecordParser()) ❶
record = sp.next()
for feat in record.features:
if feat[0] == ’DOMAIN’:
print "domain:", record.sequence[feat[1]:feat[2]+1]
fh.close()
fh = open(argv[1])
sp = SProt.Iterator(fh, SProt.SequenceParser()) ❷
sequence = sp.next()
print "sequence: ", sequence.seq
fh.close()
❶ Reading the file with SProt.RecordParser. This enables to access to the annotations and features.
❷ Re-reading the file with SProt.SequenceParser.
To build a consumer, you need to know which events the scanner will generate. Biopython distribution contains
documentation on this topic. See for instance a copy at: http://bioweb.pasteur.fr/docs/doc-gensoft/biopython/Doc/.
handle = open(argv[1])
106
Chapter 13. Biopython, continued
scanner = Enzyme._Scanner()
consumer = EnzymeConsumer()
scanner.feed(handle, consumer)
print "results: ", consumer._references
(Solution A.48)
for id in consumer._references.keys():
print id, consumer._references[id]
(Solution A.49)
handle = open(argv[1])
parser = EnzymeParser()
references = parser.parse(handle)
for id in references.keys():
print id, references[id]
(Solution A.50)
13.1.3. Iterator
An iterator is an object that sequentially return successive records from a data input:
iterator = Iterator(handle)
record = iterator.next()
while record:
print record
record = iterator.next()
From with this data input, the iterator next method provides a parser with the lines corresponding to a record,
from which the parser will build a record, e.g (simplified):
107
Chapter 13. Biopython, continued
def next(self):
# 1) read the appropriate lines (until end of record)
# 2) call the parser
return self._parser.parse(lines)
You might need to convert the lines back into a handle before passing them to the parser, since the parser rather
takes a handle. You can use the Bio.File.StringHandle class for this purpose.
handle = open(argv[1])
iterator = EnzymeIterator(handle)
record = iterator.next()
while record:
print record[’id’], record[’references’]
record = iterator.next()
handle.close()
(Solution A.51)
108
Chapter 13. Biopython, continued
database = open(db)
iterator = EnzymeIterator(database)
record = iterator.lookup(id)
if record:
print record[’id’], join(record[’references’],"")
else:
print "id not found in database ", db
database.close()
(Solution A.52)
13.1.5. Dictionary
In Section 12.2, we have seen how to define operators for a class. Namely, the __getitem__ method is a way
to define an operator to provide an indexed acces to data.
enzyme = EnzymeDictionary(db)
record = enzyme[id]
print record[’references’]
You may improve the code above by handling the KeyError exception.
(Solution A.53)
109
Chapter 13. Biopython, continued
Exercise 13.9. Fetching enzymes referenced in a SwissProt entry and display related
proteins
Re-using Exercise 11.11 to find the enzyme number from the description text of a SwissProt entry, fetch the
corresponding enzyme data. Return the list of SwissProt records referenced by the enzyme entry, and display their
entry name and description. (Solution A.55)
class Consumer(AbstractConsumer):
def end_tree(self):
pass
def begin_node(self):
pass
def end_node(self):
pass
You will also need an is_rooted function, that takes a string as parameter, to test whether the tree to be parsed
is rooted or not. (Solution A.56)
110
Chapter 13. Biopython, continued
very simple representation, such as a basic Node class for representing the tree, that would include some attributes
such as:
• name
• children
• length
You might also need a Stack class for handling internal nodes recursively. (Solution A.57).
treefile = sys.argv[1]
fh = open(treefile)
parser = Parser()
tree = parser.parse(fh)
print tree
(Solution A.58).
Exercise 13.13. Fetch a PDB entry from the RCSB Web server
Look at the code of the Bio.WWW.ExPASy module, e.g function get_sprot_raw and define a function
get_pdb_entry_remote that returns a handle (something that can be opened by open() on a given PDB entry.
The url that you need is: http://www.rcsb.org/pdb/cgi/export.cgi/%s.pdb?format=PDB&compression=None&pdbId=%s.
Try your code with the 1KCW [http://www.rcsb.org/pdb/cgi/export.cgi/1KCW.pdb?format=PDB&compression=None&pdbId=1
ident, which is the PDB entry corresponding to the CERU_HUMAN protein we are studying.
Solution A.59
• __init__
111
Chapter 13. Biopython, continued
• __str__
• set_id
• set_pdb_ident
• add_dbref
• add_ssbond
• add_residue
• add_atom
and (selectors):
• get_residues
• get_atoms
• get_ssbonds
• get_residues_by_name
• get_residues_of_chain
• residue_model
• residue_chain
Solution A.60
112
Chapter 13. Biopython, continued
We don’t really want to do this for real data. Instead, we now want to load the structure from a PDB file.
The PDBConsumer class we would like to build for this purpose roughly follows the scanner/consumer scheme
that we have seen previously (Section 13.1). Our consumer’s job is to build a PDBStructure object as it
receives parsing events. As a "scanner", you can actually use this PDBParser [modules/PDBParser.py] provided
by Thomas Hamelryck (thamelry@vub.ac.be).
Solution A.61
Tip
The methods the PDBConsumer class will contains thus should correspond to the "events" (or callbacks)
of the PDBParser class. For instance, the statement in PDBParser:
self.structure_builder.set_ssbond(_from,_to)
calls the set_ssbond method of the consumer (here structure_builder is an equivalent of our
"consumer").
Tip
When using this PDBParser class, the actual structure is returned to the program as follows:
parser=PDBParser(PDBConsumer()) ❶
struct = parser.get(id, file) ❷
❶ Parser instanciation: the consumer is passed as an argument to the __init__ of the parser.
❷ Structure creation: the get method in the parser takes an id and a filename as arguments.
113
Chapter 13. Biopython, continued
Tip
You also have to know that the get method in the PDBParser needs to call a get method in the
consumer, whose only task is to actually return the PDBStructure just built.
Exercise 13.17. Compare 3D disulfid bonds with Cys positions in the alignment (take
#1).
Compare annotated and computed disulfid bonds in 1KCW with cystein positions in the alignment. Take the code
written in Exercise 11.24 to get position with a high-level of cysteins and check if they correspond to the bonds in
the 3D structure.
Solution A.63
Exercise 13.18. Compare 3D disulfid bonds with Cys positions in the alignment (take
#2).
The positions in the alignment and in the structure are somewhat different. Write a method pdb2seq_pos which
use the DBREF lines of the PDB entry (see method add_dbref). Then, use can use this pdb2seq_pos method
to display the actual positions in the alignment.
Solution A.64
114
Chapter 14. Graphics in Python
• GUI toolkits.
14.1. Tutorials
• Python and Tkinter Programming [http://www.manning.com/Grayson/], John E. Grayson (book). The chapter
"Graphs and charts" is available online [http://www.manning.com/grayson/chapt11.pdf], together with the
source code [http://www.manning.com/getpage.html?project=grayson&filename=Source.html] of the exam-
ples.
115
Chapter 14. Graphics in Python
14.2. Software
• Gnuplot.py [http://gnuplot-py.sourceforge.net/]
• Vaults of Parnassus [http://www.vex.net/parnassus/]: see the "Graphics" section, which has a "GUI" (Graphical
User Interfaces) section.
• Piddle [http://piddle.sourceforge.net/]: module for creating two-dimensional graphics in a manner that is both
cross-pla tform and cross-media
• PyGist [http://w3.pppl.gov/~hammett/comp/python/koer.ioc.ee/man/pygraph/PyGist/PyGist_Title.mkr.html]
(PDF manual [http://w3.pppl.gov/~hammett/comp/python/PyGraphics/pygist.pdf] and presentation
[http://www.python.org/workshops/1996-06/papers/l.busby-gist.html]).
116
Chapter 14. Graphics in Python
• Plotting codon frequency (Example 11.4). This example uses a Tkinter canvas to draw a bar chart. A documen-
tation on the Tkinter canvas can be found here [http://www.pythonware.com/library/tkinter/introduction/canvas.htm].
• Plotting Cys conserved positions (Exercise 11.25). This example uses the Pmw.Blt [http://pmw.sourceforge.net/doc/Blt.html]
package to draw a plot representing Cys conservation at each position of an alignment.
117
Chapter 14. Graphics in Python
118
Appendix A. Solutions
Appendix A. Solutions
A.1. Introduction to basic types in Python
Solution A.1. GC content ()
>>> translate(dna, t)
’cgtactgcaataatgctgagacagtgcggcgccacgctgactccgcaccgcagacgacccggaaatgaa
gcggaggcgcgggacgtaaggcaaggaccggagc’
119
Appendix A. Solutions
res = []
site = dna.find(enz)
while site != -1:
res.append(site)
site = dna.find(enz, site + 1)
return res
Lcuts = []
# get all cut positions
for enz,pcut in (enzlist):
print enz, pcut
start = 0
stop = dna.find(enz)
while stop != -1:
Lcuts.append(stop + pcut)
stop = dna.find(enz, stop+1)
# sort
Lcuts.sort()
return Lcuts
def frag_len(Lcuts):
"""
get fragment lengths from a list containing the cutting positions
of an restriction digest sorted by order
+ start(=0) and end(=dna length) of the dna sequence
"""
Lres = []
start = Lcuts[0]
for end in Lcuts[1:]:
Lres.append(end-start)
start = end
120
Appendix A. Solutions
return Lres
def revcomp(dna):
""" reverse complement of a DNA sequence """
121
Appendix A. Solutions
prot = ""
for i in xrange(0,len(cdna),3):
prot += code.get(cdna[i:i+3], "?")
return prot
❶ This is a special syntax named list comprehension. It creates a list and populates it with the results of the
first expression by replacing i with all values of the for loop (see also Chapter 7).
def header(title):
"""splits a fasta header in ID and description of a sequence
if one of the two is not given None is returned instead"""
id = desc = None
122
Appendix A. Solutions
res = title.split(None,1)
if len(res) == 0:
pass
elif len(res) == 2:
id, desc = res[0][1:], res[1]
elif title[0] in string.whitespace:
desc = res[0]
else:
id = res[0][1:]
nb = 0
for i in dna:
if i not in ’atgc’:
nb += 1
return nb
else:
return "dna ok"
# second version
123
Appendix A. Solutions
ok = 1
for base in dna:
if base in alphabet:
pass ❶
else:
ok = 0
if ok:
return "dna ok"
A.3. Functions
Solution A.12. DNA complement function ()
from string import *
def complement(dna):
"function to calculate the complement of a DNA sequence" ❶
❶ If the first statement of a function is a string this string is the documentation of the function. It can be
accessed by func.func_doc.
>>> complement.func_doc
’function to calculate the complement of a DNA sequence’
124
Appendix A. Solutions
digests = []
for i in range(len(enzymes)):
for k in range(i+1, len(enzymes)):
digests.append( [enzymes[i], enzymes[k]] )
return digests
import sys
print sys.argv
In order to create a module called dna, just put the your functions definitions in a file called dna.py (dna.py
[exercises/dna.py]).
Do a:
pydoc dna
seq=Seq("actttgccatatg")
125
Appendix A. Solutions
Which import statements are necessary to make the following code work?
expasy = ExPASy.get_sprot_raw(’CERU_HUMAN’)
sp = SProt.Iterator(expasy, SProt.RecordParser())
record = sp.next()
print record.keywords
❶ This statement import "all" components from the Bio.WWW package, including the ExPASy module (see:
pydoc Bio.WWW and pydoc Bio.WWW.ExPASy, and llok at the __all__ and __path__ in the DATA
section).
The reason is that Bio.SubsMat.FreqTable is the module containing FreqTable, not the class (see pydoc
Bio.SubsMat.FreqTable). The FreqTable class is available as Bio.SubsMat.FreqTable.FreqTable.
The import statement should be:
126
Appendix A. Solutions
a=ClustalAlignment()
NameError: name ’ClustalAlignment’ is not defined
Look at the __all__ variable in the __init__.py module file (or with pydoc Bio.Clustalw). It is not empty,
but it does not contain ClustalAlignment. The import statement should be (see Example 9.2):
Display the length of a sequence, and count the number of occurrences of ’a’.
seq = Seq(’gcatgacgttattacgactctgtcacgccgcggtgcgactgaggcgtggcgtctgctggg’)
print len(seq)
print seq.count(’a’)
Display GC content.
seq = Seq(’gcatgacgttattacgactctgtcacgccgcggtgcgactgaggcgtggcgtctgctggg’)
gc = seq.count(’c’) + seq.count(’g’) / float(len(seq)) * 100
print gc
127
Appendix A. Solutions
dna = Seq(’gcatgacgttattacgactctgtcacgccgcggtgcgactgaggcgtggcgtctgctggg’) ❶
seq = SeqRecord(dna, id = ’my_seq’, description= ’a random sequence’) ❷
out = FASTA.FastaWriter(stdout)
out.write(seq)
❶ Pick one value (here 0) among the possible integer values (which should be equally distributed during the
span given).
❷ The sequence must be from class MutableSeq, of course.
128
Appendix A. Solutions
import Bio.Fasta
from sys import *
from string import *
from dna import codons
from mutateseq import mutateseq
file = argv[1]
handle = open(file)
it = Bio.Fasta.Iterator(handle, Bio.Fasta.SequenceParser())
count = {} ❶
count_random = {}
seq = it.next()
while seq:
for codon in codons(seq.seq.tostring()): ❷
if count.has_key(codon):
count[codon] += 1
else:
count[codon] = 0
mutableseq = seq.seq.tomutable()
mutateseq(mutableseq,span=1000,p=0.1)
for codon in codons(mutableseq.tostring()):
if count_random.has_key(codon):
count_random[codon] += 1
else:
count_random[codon] = 0
seq = it.next()
handle.close()
l=count.items()
l.sort()
print "count: ", l
l=count_random.items()
l.sort()
print "random: ", l
129
Appendix A. Solutions
#--------------------------------------------------------
# bar charts of codons frequencies
# - for legibility, 2 charts are built
# - both random and normal frequencies are dsplayed
def codon_sort(a,b):
if a < b:
return -1
elif a > b:
return 1
else:
return 0
labels=count.keys()
labels.sort(codon_sort)
w1=window(plot_title=’Count codons’,width=1000)
y=array(count.values())[:len(count)/2]
x=arange(len(y)+1)
w1.bar(y,x,label=labels[:len(count)/2])
w2=window(plot_title=’Count codons(2)’,width=1000)
y=array(count.values())[(len(count)/2)+1:]
x=arange(len(y)+1)
w2.bar(y,x,label=labels[(len(count)/2)+1:])
y=array(count_random.values())[:len(count_random)/2]
x=arange(len(y)+1)
w1.bar(y,x,label=labels[:len(count_random)/2])
y=array(count_random.values())[(len(count_random)/2)+1:]
x=arange(len(y)+1)
w2.bar(y,x,label=labels[(len(count_random)/2)+1:])
130
Appendix A. Solutions
def convert_sp_fasta(infile,outfile):
"""
convert a SwissProt file into a Fasta formatted file
"""
in_h = open(infile)
sp = SProt.Iterator(in_h, SProt.SequenceParser())
out_h = FASTA.FastaWriter(outfile)
sequence = sp.next()
out_h.write(sequence)
131
Appendix A. Solutions
in_h.close()
out_h.close()
import re
def get_enzyme_ref(record):
description = record.description
enzyme_re = re.compile(r’\(EC\s+(?P<id>([\w\.]+))\).*’)
m=enzyme_re.search(record.description)
return m.group(’id’)
import Bio.Prosite
132
Appendix A. Solutions
prosite=Bio.Prosite.ExPASyDictionary(parser=Bio.Prosite.RecordParser())
def get_prosite_pattern(id):
record=prosite[id]
return record.pattern
You can also use a local Prosite database, and use the golden program to fetch entries:
def get_prosite_pattern_local(id):
cmd="golden prosite:" + id
handle=popen(cmd, ’r’)
iterator=Iterator(handle,RecordParser())
record=iterator.next()
handle.close()
return record.pattern
def get_prosite_refs(handle):
sp = SProt.Iterator(handle, SProt.RecordParser())
refs=[]
record = sp.next()
for ref in record.cross_references:
if ref[0] == ’PROSITE’:
refs.append(ref[1])
return refs
To display the patterns of the Prosite references given in a SwissProt entry, you can do like this:
133
Appendix A. Solutions
import sys
from Bio.SwissProt import SProt
from sprot import get_prosite_refs, get_prosite_pattern
sp = open(sys.argv[1])
prosite_refs = get_prosite_refs(sp)
sp.close()
for id in prosite_refs:
print id
pattern=get_prosite_pattern(id)
print pattern
Solution A.31. Search for occurrences of a protein PROSITE patterns in the sequence
Exercise 11.14
import sys
from Bio.Prosite import Pattern
from Bio.SwissProt import SProt
from sprot import get_prosite_refs, get_prosite_pattern
# prosite refs
sp = open(sys.argv[1])
prosite_refs = get_prosite_refs(sp)
sp.close()
# sequence
sp = open(sys.argv[1])
iterator = SProt.Iterator(sp, SProt.SequenceParser())
seq = iterator.next().seq
sp.close()
for id in prosite_refs:
print id
pattern=get_prosite_pattern(id)
print pattern
p = Pattern.compile(pattern)
m = p.search(seq)
print "[", m.start(), ":", m.end(), "]", seq[m.start():m.end()]
134
Appendix A. Solutions
A.5.3. GenBank
Solution A.32. Extracting the complete CDS from a GenBank entry
Exercise 11.15
import string
def get_complete_cds(record):
"""
record should be an instance of Bio.SeqRecord.Record
"""
if string.find(record.description, ’complete cds’) == -1:
return None
for feature in record.features:
if feature.type == ’CDS’:
seq = record.seq
return seq[feature.location.start.position:feature.location.end.position]
return ""
A.5.4. Blast
Solution A.33. Local Blast
Exercise 11.16
query_file = sys.argv[1]
# blast
if len(sys.argv) > 2:
E_VALUE_THRESH=sys.argv[2]
else:
135
Appendix A. Solutions
E_VALUE_THRESH=0
done={}
blast_parser = NCBIStandalone.BlastParser()
blastcmd=’/local/gensoft/bin/scripts/blastall’
blast_out, error_info = NCBIStandalone.blastall(blastcmd=blastcmd,
program=’blastp’,
database=’swissprot’,
infile=query_file,
expectation=1,
descriptions=10,
alignments=10)
blast_record = blast_parser.parse(blast_out)
for (description,alignment) in zip(blast_record.descriptions,blast_record.alignments):
hsp_nb = 0
for hsp in alignment.hsps:
hsp_nb = hsp_nb + 1
if hsp.expect <= E_VALUE_THRESH:
sbjct=hsp.sbjct.replace(’-’,”)
print "%s HSP %d " % (description.title, hsp_nb)
#
# run: script query_file [db [result_file]]
#
from Bio.Blast import NCBIWWW
from Bio import Fasta
from sys import *
query_file = open(argv[1])
if len(argv) >= 3:
# e.g: nr
db = argv[2]
else:
db = ’swissprot’
if len(argv) >= 4:
result_file = argv[3]
else:
result_file = argv[1] + ’.blast’
136
Appendix A. Solutions
fasta = Fasta.Iterator(query_file)
query = fasta.next()
query_file.close()
blast_results = open(argv[1])
blast_parser = NCBIWWW.BlastParser()
record = blast_parser.parse(blast_results)
❶ The zip Python function merge 2 lists with one item from each list every 2 items. So, in the code above,
you get a list with a description, an alignment, the next description, the next alignment, etc...
137
Appendix A. Solutions
file = argv[1]
E_VALUE_THRESH = 0.04
b_parser = NCBIStandalone.PSIBlastParser()
138
Appendix A. Solutions
sp_id = argv[1]
if len(argv) > 2:
db = argv[2]
else:
db=’swissprot’
# blast config
blast_parser = NCBIStandalone.PSIBlastParser()
E_VALUE_THRESH = 0.04
blastcmd=’/local/gensoft/bin/scripts/blastpgp’
#----------------------------------------------------------------
# get SP entry and PROSITE references
print >>stderr, "Fetching ", sp_id, " from ExPASy..."
139
Appendix A. Solutions
expasy = ExPASy.get_sprot_raw(sp_id)
prosite_refs = sprot.get_prosite_refs(expasy)
expasy.close()
queryfile = write_query(sp_id)
#----------------------------------------------------------------
# actual phi-blasting of each PROSITE pattern
for ref in prosite_refs:
pattern = sprot.get_prosite_pattern(ref)
print >>stderr, "Doing ", ref, " ...."
print >>stderr, pattern
patternfile = write_pattern(ref, pattern)
blast_record = blast_parser.parse(blast_out)
for round in blast_record.rounds:
for alignment in round.alignments:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print ’****Alignment****’
print ’sequence:’, alignment.title
print ’length:’, alignment.length
print ’e value:’, hsp.expect
print hsp.query[0:75] + ’...’
print hsp.match[0:75] + ’...’
print hsp.sbjct[0:75] + ’...’
os.unlink(patternfile)
os.unlink(queryfile)
from os import *
import string
DB_ROOT = ’/local/databases/fasta’
140
Appendix A. Solutions
result=run_fasta(’data/ceru_human.fasta’, ’gpmam’)
A.5.5. Clustalw
Solution A.39. Doing a Clustalw alignemnent
Exercise 11.22
import os
from Bio.Clustalw import MultipleAlignCL
from Bio.Clustalw import do_alignment
from sys import *
cline = MultipleAlignCL(argv[1])
cline.set_output(argv[2])
print "Command line: ", cline
align = do_alignment(cline)
for seq in align.get_all_seqs():
print seq.description
print seq.seq
141
Appendix A. Solutions
fasta_seqs = FastaAlignment(alphabet=IUPAC.protein)
# first, put the entire query sequence in the fasta set of sequences
fasta_handle = open(sys.argv[1])
seq = FASTA.FastaReader(fasta_handle).next()
fasta_handle.close()
fasta_seqs.add_sequence(descriptor=seq.description,
sequence=seq.seq.tostring())
# blast
E_VALUE_THRESH=0
done={}
blast_parser = NCBIStandalone.BlastParser()
blastcmd=’/local/gensoft/bin/scripts/blastall’
blast_out, error_info = NCBIStandalone.blastall(blastcmd=blastcmd,
program=’blastp’,
database=’swissprot’,
infile=sys.argv[1],
expectation=1,
descriptions=10,
alignments=10)
blast_record = blast_parser.parse(blast_out)
for (description,alignment) in zip(blast_record.descriptions,blast_record.alignments):
hsp_nb = 0
for hsp in alignment.hsps:
hsp_nb = hsp_nb + 1
if hsp.expect <= E_VALUE_THRESH:
sbjct=hsp.sbjct.replace(’-’,”)
title = description.title
if done.has_key(title):
142
Appendix A. Solutions
continue
else:
done[title] = 1
print "%s HSP %d " % (title, hsp_nb)
fasta_seqs.add_sequence(descriptor="%s HSP %d " % (title, hsp_nb),
sequence=sbjct)
# alignment
cline = MultipleAlignCL(alig_f)
cline.set_output(clustalw_out)
clustalw_align = do_alignment(cline)
os.unlink(alig_f)
import Bio.Clustalw
import Bio.Align.AlignInfo
from Bio.Alphabet import IUPAC
from sys import *
if len(argv) == 2:
threshold=40.0
else:
threshold=argv[2]
# ----------------------
print "Conservation above %d: " % threshold
for pos in xrange(alig_len):
for letter in pssm[pos].keys():
percent = (pssm[pos][letter] / max) * 100.0
143
Appendix A. Solutions
y = []
for pos in xrange(alig_len):
max_percent = 0
for letter in pssm[pos].keys():
percent = (pssm[pos][letter] / max) * 100.0
if letter == ’C’ and percent > max_percent:
max_percent = percent
y.append(max_percent)
144
Appendix A. Solutions
145
Appendix A. Solutions
A.6. Classes
Solution A.43. Define a PDB structure class
Exercise 12.1
class PDBStructure:
def __init__(self):
self._residues=[]
❶ The residue is an anonymous dictionary being returned as a result to the method call, in order for the user of
the class to pass it as an argument to the next add_atom method call.
❷ Is is the residue structure passed as argument that is actually changed, not a copy of it.
146
Appendix A. Solutions
class PDBStructure:
def __init__(self):
self.residues=[]
self._ssbonds = []
self.dbrefs = ""
def get_residues(self):
return self.residues
147
Appendix A. Solutions
result.append(residue)
return result
if __name__ == ’__main__’:
print "--------------testing my class--------------------"
struct = PDBStructure()
model_id = 0
chain_id = "A"
chain_id = "B"
148
Appendix A. Solutions
149
Appendix A. Solutions
ExtendedIUPACDNA
id: extended_dna
NucleotidAlphabet letters: "GATCBDSW"
IUPACAmbiguousRNA
id: ambiguous_rna
letters: "GAUCRYWSMKHBVDN"
RNAAlphabet
IUPACUnambiguousRNA
id: unambiguous_rna
letters: "GAUC"
SecondaryStructure
size = 1
letters = "HSTC"
HasStopCodon
stop_symbol = "*"
AlphabetEncoder
alphabet
new_letters
Gapped
gap_char = "−"
class PDBStructure:
150
Appendix A. Solutions
def __init__(self):
self.residues=[]
self._verbose = 0
class PDBStructure:
_verbose = 0
def __init__(self):
self.residues=[]
151
Appendix A. Solutions
’posseq’: posseq,
’atoms’: []}
self.residues.append(residue)
return residue
if __name__ == ’__main__’:
print "--------------testing my class--------------------"
struct = PDBStructure()
model_id = 0
chain_id = "A"
chain_id = "B"
152
Appendix A. Solutions
class EnzymeConsumer(AbstractConsumer):
def __init__(self):
self._references = ""
handle = open(argv[1])
scanner = Enzyme._Scanner()
consumer = EnzymeConsumer()
scanner.feed(handle, consumer)
print "results: ", consumer._references
153
Appendix A. Solutions
class EnzymeConsumer(AbstractConsumer):
def __init__(self):
self._references = {}
def end_record(self):
self._references[self._id] = self._refs
handle = open(argv[1])
scanner = Enzyme._Scanner()
consumer = EnzymeConsumer()
scanner.feed(handle, consumer)
for id in consumer._references.keys():
print id, consumer._references[id]
class EnzymeParser(AbstractParser):
def __init__(self):
self._scanner = Enzyme._Scanner()
self._consumer = EnzymeConsumer()
154
Appendix A. Solutions
return self._consumer._references
handle = open(argv[1])
parser = EnzymeParser()
references = parser.parse(handle)
for id in references.keys():
print id, references[id]
import re
import string
class EnzymeIterator:
blank = re.compile(r’^\s*$’)
def next(self):
lines = ""
while 1:
line = self._uhandle.readline()
if not line:
break
if EnzymeIterator.blank.match(line):
break
lines += line
if line[:2] == ’//’:
break
if not lines:
155
Appendix A. Solutions
return None
handle = open(argv[1])
iterator = EnzymeIterator(handle)
record = iterator.next()
while record:
print record[’id’], record[’references’]
record = iterator.next()
handle.close()
The consumer and the parsers may be the same as before (Exercise 13.3). However, in order to get this output:
156
Appendix A. Solutions
The consumer and the parser are the same than in Exercise 13.5. The iterator defines an additional method,
lookup, to search in the database:
import re
import string
class EnzymeIterator:
def lookup(self,id):
ID = re.compile(r’ID\s*(?P<id>([\w\.]+))\s*’)
while 1:
record = self.next()
if not record:
break
m=ID.match(record[’id’])
if m.group(’id’) == id:
return record
return None
def next(self):
blank = re.compile(r’^\s*$’)
lines = []
start=1
while(1):
line = self._uhandle.readline()
if start:
while line[:2] == ’CC’:
line = self._uhandle.readline()
if start:
while line[:2] == ’//’:
line = self._uhandle.readline()
start=0
if not line:
break
if blank.match(line):
break
lines.append(line)
if line[:2] == ’//’:
break
157
Appendix A. Solutions
if not lines:
return None
data = string.join(lines, ”)
if self._parser is not None:
return self._parser.parse(File.StringHandle(data))
return data
ENZYMEDB = ’/local/databases/release/Enzyme/enzyme.dat’
o, id = getopt.getopt(argv[1:], ’d:’)
opts = {}
for k,v in o:
opts[k] = v
if opts.has_key(’-d’):
db = opts[’-d’]
else:
db = ENZYMEDB
if len(id) < 1:
usage(); sys.exit("provide an id to search")
else:
id=id[0]
database = open(db)
iterator = EnzymeIterator(database)
record = iterator.lookup(id)
if record:
print record[’id’], join(record[’references’],"")
else:
print "id: ", id, " not found in database ’", db, "’"
database.close()
158
Appendix A. Solutions
class EnzymeDictionary:
#_ID = re.compile(r’ID\s*(?P<id>([\w\.]+))\s*’)
ENZYMEDB = ’/local/databases/release/Enzyme/enzyme.dat’
o, id = getopt.getopt(argv[1:], ’d:’)
opts = {}
for k,v in o:
opts[k] = v
if opts.has_key(’-d’):
db = opts[’-d’]
else:
159
Appendix A. Solutions
db = ENZYMEDB
if len(id) < 1:
usage(); sys.exit("provide an id to search")
else:
id=id[0]
enzyme = EnzymeDictionary(db)
try:
record = enzyme[id]
print record[’id’], join(record[’references’],"")
except KeyError, e:
print "key not found: ", e
Create an EnzymeParsing.py file containing the required classes (the one used in Exercise 13.7 preferably).
Solution A.55. Fetching enzymes referenced in a SwissProt entry and related proteins
Exercise 13.9
160
Appendix A. Solutions
sp_record = sprot.get_sprot_entry_local(argv[1])
enzyme_id = sprot.get_enzyme_ref(sp_record)
enzyme = EnzymeDictionary(’/local/databases/release/Enzyme/enzyme.dat’)
enzyme_record = enzyme[enzyme_id]
for ref in enzyme_record[’references’]:
sp_id = ref[’id’]
sp_r = sprot.get_sprot_entry_local(sp_id)
print sp_r.entry_name, sp_r.description
class Scanner:
tree_text = "".join(uhandle.readlines()).replace("\n","")
tree_text = re.sub(’\s’, ”, tree_text)
print "text: ", tree_text
pos = 0
rooted = is_rooted(tree_text)
consumer.start_tree(rooted)
while 1:
c = tree_text[pos]
if c == ’(’:
consumer.begin_node()
pos += 1
elif c == ’)’:
consumer.end_node()
pos += 1
161
Appendix A. Solutions
elif c == ’,’:
pos += 1
elif c == ’:’:
# ready to process branch length
pos += 1
c = tree_text[pos]
length = ”
while re.match(’[\.\d]’,c):
length += c
pos += 1
c = tree_text[pos]
consumer.branch_length(float(length))
elif c == ’;’:
consumer.end_tree()
break
elif c == "’":
pos += 1
else:
name = ”
while re.match(’\w’,c):
name += c
pos += 1
c = tree_text[pos]
consumer.leaf(name)
def is_rooted(tree):
pos = 0
c = tree[pos]
depth = 0
comma = 0
for pos in range(0,len(tree)):
c = tree[pos]
if c == ’(’:
depth += 1
elif c == ’)’:
depth -= 1
elif c == ’,’:
if depth == 1:
comma += 1
return comma == 1
162
Appendix A. Solutions
class Consumer(AbstractConsumer):
def __init__(self):
self.stack = Stack()
self.data = None
def begin_node(self):
node = Node()
self.stack.push(node)
self.depth = self.depth + 1
def end_node(self):
# 3 cases: depth 1 and unrooted tree (3 nodes)
# or internal node (2 nodes)
# or (depth 1 and rooted tree) (2 nodes)
if self.depth == 1 and not self.rooted:
right = self.stack.pop()
middle = self.stack.pop()
left = self.stack.pop()
parent = self.stack.top()
parent.left = left
parent.middle = middle
parent.right = right
else:
# internal node or rooted tree
right = self.stack.pop()
left = self.stack.pop()
parent = self.stack.top()
parent.left = left
parent.right = right
self.depth -= 1
163
Appendix A. Solutions
def end_tree(self):
self.data = self.stack.pop()
class Node:
def __str__(self):
if self.name is None:
if self.middle is not None:
return "(" + str(self.left.__str__()) + "," + str(self.middle.__str__()) + "," + str(
else:
return "(" + str(self.left.__str__()) + "," + str(self.right.__str__()) + ’:’ + str(s
else:
return str(self.name) + ’:’ + str(self.length)
class Stack:
def __init__(self):
self._l = []
def pop(self):
last = self._l[-1]
self._l = self._l[:-1]
return last
def top(self):
return self._l[-1]
def empty(self):
164
Appendix A. Solutions
self._l = []
import sys
from Bio.ParserSupport import *
class Parser(AbstractParser):
def __init__(self):
self._scanner = Scanner()
self._consumer = Consumer()
A.7.3. PDB
import urllib
import string
from Bio import File
def get_pdb_entry_remote(id):
#http://www.rcsb.org/pdb/cgi/export.cgi/1KCW.pdb?format=PDB&pdbId=1KCW&compression=None
fullcgi = "http://www.rcsb.org/pdb/cgi/export.cgi/%s.pdb?format=PDB&compression=None&pdbId
#print fullcgi
handle = urllib.urlopen(fullcgi)
uhandle = File.UndoHandle(handle)
if not uhandle.peekline():
raise IOError, "no results"
165
Appendix A. Solutions
return uhandle
def get_pdb_entry_local(id):
id = string.lower(id)
filename = "data/pdb" + id + ".pdb"
try:
print "trying to open " , filename
handle = open(filename)
except IOError, e:
print e
filename = "data/" + id + ".pdb"
try:
print "trying to open " , filename
handle = open(filename)
except IOError, e:
print e
filename = "data/pdb" + id + ".ent"
try:
print "trying to open " , filename
handle = open(filename)
except IOError, e:
print e
return None
return handle
Add the following code to the code already written in Exercise 12.6:
def __str__(self):
for residue in self._residues:
print residue
return ""
166
Appendix A. Solutions
#
# PDBConsumer creates instances of PDBStructure
#
class PDBConsumer(AbstractConsumer):
_verbose = 0
def __init__(self):
self._current_struct = None
167
Appendix A. Solutions
if self._verbose:
print "init_chain: ", chain_id
self._current_chain_id = chain_id
residue = self._current_struct.add_residue(self._current_model_id,
self._current_chain_id,
name, field, posseq, icode)
self._current_residue = residue
self._current_struct.add_atom(self._current_residue,
name, coord, tempfactor,
occupancy, altloc, element)
def get(self):
return self._current_struct
168
Appendix A. Solutions
BRIDGE_DIST=8.0
def disulfid_bridges(self):
sulfurs=[]
for cys_residue in self.get_residues_by_name(’CYS’):
#print "cys: ",cys_residue[’name’], cys_residue[’posseq’]
for atom in cys_residue[’atoms’]:
if atom[’name’] == ’SG’:
sulfurs.append({’posseq’: cys_residue[’posseq’],
’atom’: atom})
result=[]
nb = len(sulfurs)
for i in xrange(nb):
for j in xrange(i+1, nb):
d = self.dist(sulfurs[i][’atom’][’coord’], sulfurs[j][’atom’][’coord’])
if d < self.BRIDGE_DIST:
print "residue %d in contact with residue %d (distance:%.3f)." % (sulfurs[
result.append({’from’: sulfurs[i][’posseq’],
’to’: sulfurs[j][’posseq’],
’dist’: d
})
return result
#
# Compute disulfide bonds.
#
# - search for sulfur (S) atoms in Cys residues of the structure
# - compute distance between all of them
# - displays residue pairs (position) where distance < BRIDGE_DIST
#
#
169
Appendix A. Solutions
if __name__ == ’__main__’:
p=PDBParser(PDBConsumer())
struct = p.get("scratch", sys.argv[1])
detected = struct.disulfid_bridges()
for annot in struct._ssbonds:
found=0
for detect in detected:
if annot[’from’] == detect[’from’] and annot[’to’] == detect[’to’]:
print annot, " also detected: ", detect[’dist’]
found=1
break
if not found:
print annot, " not found"
Solution A.63. Compare 3D disulfid bonds with Cys positions in the alignment (take #1).
Exercise 13.17
import Bio.Clustalw
from Bio.Seq import Seq
import Bio.Align.AlignInfo
from Bio.WWW import *
from Bio.Alphabet import IUPAC
from Bio.SwissProt import SProt
import sys
from os import *
import string
from WWWPDB import *
from PDBParser import PDBParser
from PDBConsumer import PDBConsumer
def get_pdb_entries(sprot):
refs=[]
for ref in sprot.cross_references:
if ref[0] == ’PDB’:
refs.append(ref[1])
170
Appendix A. Solutions
return refs
def align2seqpos(seq,col):
"returns the original sequence position from a gapped sequence position"
s=list(seq.tostring())
gaps = 0
for i in xrange(len(s)):
if i >= col:
break
if s[i] == ’-’:
gaps = gaps + 1
#print "gaps: ", gaps
result = col - gaps
return result
def get_seq_description(alignment,seq_nb):
return alignment._records[seq_nb].description
"""
open alignement and create pssm
"""
"""
fetch PDB entry from swissprot references
"""
seq_id = get_seq_description(align,0)
171
Appendix A. Solutions
refs = get_pdb_entries(seq_record)
print "PDB reference: ", refs
try:
pdb_handle = get_pdb_entry_remote(refs[0])
except IOError, e:
print e
pdb_handle = get_pdb_entry_local(refs[0])
p=PDBParser(PDBConsumer())
struct = p.get_handle("scratch", pdb_handle)
# comparison
Solution A.64. Compare 3D disulfid bonds with Cys positions in the alignment (take #2).
first add a method pdb2seq_pos into class PDBStructure:
def pdb2seq_pos(self):
"""
DBREF 1KCW 1 338 SWS P00450 CERU_HUMAN 20 357
DBREF 1KCW 347 474 SWS P00450 CERU_HUMAN 366 493
DBREF 1KCW 483 884 SWS P00450 CERU_HUMAN 502 903
DBREF 1KCW 892 1040 SWS P00450 CERU_HUMAN 904 1059
"""
172
Appendix A. Solutions
lines = self.dbrefs.split("\n")
items = lines[0].split()
pos_pdb = string.atoi(items[7])
pos_sws = string.atoi(items[2])
if pos_sws < pos_pdb:
return pos_pdb - pos_sws
else:
return pos_sws - pos_pdb
Exercise 13.18
#! /local/bin/python
import Bio.Clustalw
from Bio.Seq import Seq
import Bio.Align.AlignInfo
from Bio.WWW import *
from Bio.Alphabet import IUPAC
from Bio.SwissProt import SProt
import sys
from os import *
import string
from WWWPDB import *
from PDBParser import PDBParser
from PDBConsumer import PDBConsumer
def get_pdb_entries(sprot):
refs=[]
for ref in sprot.cross_references:
if ref[0] == ’PDB’:
refs.append(ref[1])
return refs
173
Appendix A. Solutions
def align2seqpos(seq,col):
"returns the original sequence position from a gapped sequence position"
s=list(seq.tostring())
gaps = 0
for i in xrange(len(s)):
if i >= col:
break
if s[i] == ’-’:
gaps = gaps + 1
#print "gaps: ", gaps
result = col - gaps
return result
def get_seq_description(alignment,seq_nb):
return alignment._records[seq_nb].description
"""
open alignement and create pssm
"""
"""
fetch PDB entry from swissprot references
"""
seq_id = get_seq_description(align,0)
print "Swissprot ID: ", seq_id
try:
seq_record = get_sprot_entry_remote(seq_id)
except IOError, e:
#print "Remote acces not available: ", e
seq_record = get_sprot_entry_local(seq_id)
refs = get_pdb_entries(seq_record)
print "PDB reference: ", refs
try:
pdb_handle = get_pdb_entry_remote(refs[0])
except IOError, e:
print e
174
Appendix A. Solutions
pdb_handle = get_pdb_entry_local(refs[0])
p=PDBParser(PDBConsumer())
struct = p.get_handle("scratch", pdb_handle)
# comparison
diffpos = struct.pdb2seq_pos()
print "difference in PDB and sequence position: ", diffpos
175
Appendix A. Solutions
176
Appendix B. Bibliography
Appendix B. Bibliography
Bibliography
[Beaz2001] David M. Beazley. Python. Essential Reference. 2. New Riders. 2001.
[Tis2001] James Tisdall. Beginning Perl for Bioinformatics. An introduction to Perl for Biologists. O’Reilly.
2001.
177
Appendix B. Bibliography