Python For Bioinformatics Cap 4
Python For Bioinformatics Cap 4
4.1 IF-ELSE
The most classic control structure is the conditional one. It acts upon the result of
an evaluation. If you know any other computer language, chances are that you are
familiar with if-else.
1
This is equivalent to saying that the condition is executed while the condition is true.
69
70 Python for Bioinformatics
If evaluates an expression. If the expression is true, the block of code just after
the if clause is executed. Otherwise, the block under else is executed.
A basic schema of an if-else condition,
if EXPRESSION:
BLOCK1
else:
BLOCK2
Program output,
You don’t have to type the code in Listing 4.1 (or any other from this book).
It is available to download from its GitHub repository at https://github.com/
Serulab/Py4Bio. It also can be used online at Microsoft Azure Notebooks (https:
//notebooks.azure.com/library/py3.us). Both links are also available at the
book’s website (http://py3.us/).
Try to execute the code (either locally or online) rather than just read it from
this book.
Another example,
1 three_letter_code = {’A’:’Ala’,’N’:’Asn’,’D’:’Asp’,’C’:’Cys’}
2 aa = input(’Enter one letter: ’)
3 if aa in three_letter_code:
4 print(’The three letter code for {0} is {1}’.format(aa,
5 three_letter_code[aa]))
6 else:
7 print("Sorry, I don’t have it in my dictionary")
Program output,
if EXPRESSION1:
BLOCK1
elif EXPRESSION2:
BLOCK2
elif EXPRESSION3:
BLOCK3
else:
BLOCK4
You can use as many elif as conditions you want to evaluate. Take into account
that once a condition is evaluated as true, the remaining conditions are not checked.
The following program evaluates more than one condition using elif :
This program (elif1.py) asks for a string with a DNA sequence entered with
the keyboard at runtime. This sequence is called dna. In line 2 its size is calculated
and this result is bound to the name seqsize. In line 3 there is an evaluation. If seqsize
is lower than ten, the message “The primer must have at least ten nucleotides” is
printed. The program flows goes to the end of this if statement, without evaluating
72 Python for Bioinformatics
any other condition in this if statement. But if it is not true (for example, if the
sequence length was 15), it would execute the next condition and its associated
block in case that condition is evaluated as true. If the sequence length were of a
value greater than 10, the program would skip line 4 (the block of code associated
with the first condition) and would evaluate the expression in line 5. If this condition
is met, it will print “This size is OK”. If there is no expression that evaluates as
true, the else block is executed.
>>> bool(1==’1’)
False
>>> answer=42
>>> answer
42
>>> answer==3
False
>>> answer==42
True
See how the expression is evaluated in line 5. This leads us to think about
inserting multiple statements in one if, like in Listing 4.6:
1 x = ’N/A’
2 if x != ’N/A’ and 5 < float(x) < 20:
3 print(’OK’)
4 else:
5 print(’Not OK’)
This expression is evaluated from left to right. If one part of the expression is
false, the following parts are not evaluated. Since x=’N/A’, the program will print
’Not OK’ (because the first condition is false). Look what happens when the same
expression is written in reverse order.
This listing (multiplepart2.py),
1 x = ’N/A’
74 Python for Bioinformatics
returns:
if EXPRESSION:
BLOCK
# Rest of the program...
To make the same code more readable, Python provides the pass statement.
This statement is like a placeholder; it has no other purpose than to put something
when a statement is required syntactically. The following code produces the same
output as the former code:
if CONDITION:
BLOCK
else:
pass
# Rest of the program...
This line will take the value of expression1, if condition is true; otherwise, it will
take the value of expression2.
This syntax allows us to write:
Programming: Flow Control 75
>>> total = 5
>>> items = 2
>>> print(’Average = {0}’.format(total/items if items != 0 else ’N/A’))
Average = 2.5
instead of,
>>> total = 5
>>> items = 2
>>> if items != 0:
... print(’Average = {0}’.format(total/items))
... else:
... print(’Average = N/A’)
...
Average = 2.5
For example:
Note the colon at the end of the first line. This is mandatory. As the indentation
of the block of code the colon is part of the for loop. This structure results in the
repetition of BLOCK as many times as elements are in the iterable object. On
each iteration, V AR takes the value of the current element in IT ERABLE. In
the following code, for walks through a list (bases) with four elements. On each
iteration, x takes the value of one of the elements in the list.
C
T
G
A
To know the position on the iterable you are iterating, the method enumerate
will return the index of the iterable along with the value.
In other languages, the for loop is used to allow a block of code to run a number
of times while changing a counter variable. This behavior can be reproduced in
Python by iterating over a list of numbers:
The following code calculates the molecular weight of a protein based on its
individual amino acids.4 Since the amino acid is stored in a string, the program will
walk through each letter by using for.
Code explanation: On the first line the user is requested to enter a protein
sequence (for example, MKTFVLHIFIFALVAF). The string returned by input is named
protseq. From line 2 to 5, a dictionary (protweight) with the aminoa acid weights
is initialized. A for loop is used in line 7 to iterate over each element in protseq.
In each iteration, aa takes a value from an element from protseq. This value is
used to search in the protweight dictionary. After the cycle, totalW will end up
with the sum of the weight of all amino acids. In line 9 there is a correction due
to the fact that each bond involves the loss of a water molecule (with molecular
weight of 18). The last line prints out the net weight.
while EXPRESSION:
BLOCK
>>> a = 10
>>> while a < 40:
... print(a)
... a += 10
...
10
20
30
A way to exit from a while loop is using break. In this case the loop is broken
without evaluating the loop condition. break is often used in conjunction with a
condition that is always true:
>>> a = 10
>>> while True:
... if a < 40:
... print(a)
... else:
... break
... a += 10
...
10
20
30
This is done to ensure the block inside the loop is executed at least once. In
other languages there is a separate loop type for these cases (do while), but it is
not present in Python.5
5 code = color_pair[1]
6 print(code)
In this code there is a for loop to iterate over color_code list. For each element,
that is, for each tuple, it checks for the first element. When it matches our query
(name), the program stores the associated code in code.
So the output of this program is “3.”
The problem with this program is that the whole sequence is walked over, even
if we don’t need to. In this case, the condition in line 4 is evaluated once per each
element in color_code when it is clear that once the match is positive there is no
need to keep on testing. You can save some time and processing power by breaking
the loop just after the positive match:
This code is identical to Listing 4.9 with the exception of the break statement
in line 6. The output is the same as before, but this time you don’t waste CPU
cycles iterating over a sequence once the element is found. The time saved in this
example is negligible, but if the program has to do it several times over a big list
or file (you can also iterate over a file), break can speed it up in a significant way.
The use of break can be avoided, but the resulting code is not legible as in
Listing 4.10:
In a case like this, with a list that can easily fit in memory, it is a better idea
to create a dictionary and query it:
80 Python for Bioinformatics
4.5 WRAPPING IT UP
Now we will combine if, for, while and the data type seen up to this point. Here
I present some small programs made with the tools we’ve just learned:
7 charge += aa_charge.get(’aa’, 0)
8 print(charge)
First Version
Code explanation: Takes a string (prot_seq) entered by the user. The pro-
gram uses a dictionary (prot_deg) to store the NUMBER of codons that corre-
sponds to each amino acid. From line 7 to 9, we generate sliding windows of length
15. For each 15 amino acid segments, the number of codons is evaluated, then we
82 Python for Bioinformatics
select the segment with less degeneration (line 14). Note that in line 10 there is a
check of the size of segment, since when the sequence of prot_seq slides away, the
subchain has less than 15 amino acids.
Code explanation: This version doesn’t use a for to walk over prot_seq;
instead, it uses while. Code will be executed as long as the sliding window is inside
prot_seq.
Code explanation: In this case every degeneration value is compared with the
last one (line 10), and if the current value is lower, it is stored. Note that the first
time a degeneration value is evaluated, there is no value to compare it with. This
problem is sorted in line 6 where a maximum theoretical value is provided.
4.7 SELF-EVALUATION
1. What is a control structure?
3. When would you use for and when would you use while?
4. Some languages have a do while control structure. How can you get a similar
function in Python?
5. Explain when you would use pass and when you would use break.
6. In line 6 of Listing 4.16, the condition under the while can be changed from
len(P rotSeq[i : i + 15]) == 15 to i < (len(P rotSeq) − 7). Why?
84 Python for Bioinformatics
7. Make a program that outputs all possible IP addresses, that is, from 0.0.0.0
to 255.255.255.255.
8. Make a program to solve a linear equation with two variables. The equation
must have this form:
a1 .x + a2 .y = a3
b1 .x + b2 .y = b3
The program must ask for a1 , a2 , a3 , b1 , b2 , and b3 and return the value of x
and y.
10. Make a program to convert Fahrenheit temperature to Celsius and write the
result with only one decimal value. Use this formula to make the conversion:
Tc = (5/9) ∗ (Tf − 32)
11. Make a program that converts everything you type into Leetspeak, using the
following equivalence: 0 for O, 1 for I (or L), 2 for Z (or R), 3 for E, 4 for A,
5 for S, 6 for G (or B), 7 for T (or L), 8 for B, and 9 for P (or G and Q). So
“Hello world!” is rendered as “H3770 w02ld!”
12. Given two words, the program must determine if they rhyme or not. For this
question “rhyme” means that the last three letters are the same, like wizard
and lizard.
13. Given a protein sequence in the one-letter code, calculate the percentage of
methionine (M) and cysteine (C). For example, from MFKFASAVILCLVAASSTQA
the result must be 10% (1 M and 1 C over 20 amino acids).
14. Make a program like Listing 4.17 but without using a predefined maximum
value.