exam_programming_exercises
exam_programming_exercises
Exercise 2 - 3 points. In the file msa.fasta count the number of sequences containing 20 or
more basic residues (R, K and H)
Exercise 3 - 4 points. Compute the pairwise identity between both sequences in file
align.fasta. The pairwise identity between two sequences is defined as the number of
columns containing identical residues divided by the total number of ungapped
columns.
Exercise 2 - 2 points. The script msa.py performs an alignment of two sequence profiles. In
this version (the same we implemented in class), the score for one cell (one column of one
profile aligned to one column of the other profile) is set to the average of the matching scores
for the individual sequences while ignoring gaps. Modify the align function in msa.py so that
the average score takes gaps into account: the individual score of any amino acid aligned with
a gap should get the value of gep.
Exercise 3 - 2 points. The script similarity.py computes the pairwise identity between two
sequences in a file in fasta format. Modify it so that it reads a substitution matrix and computes
the pairwise similarity as well (given as a percentage). The pairwise similarity between two
sequences is defined as the number of columns containing pairs with values >0 in the
substitution matrix divided by the total number of ungapped columns. Similarity ≥ identity.
Exercise 4 - 2 points. The script readtree1.py prints the names of all leaves of a tree (one
line per node). Modify the code so that it also prints (together with the name; and also in the
same line) node id, parent node id and a string indicating whether the node is “right” or “left”
child of the parent node. Do not modify any function, only the main code.
Exercise 5 - 2 points. Modify the code in the previous exercise (make a copy of readtree1.py
and save it as readtree2.py) to remove from the nodes dictionary all nodes that are left-child
leaves. Print the number of nodes after removal. Do not modify any function, only the main
code.
Exercise 3 - 3 points. Modify the script nw.py again (save it as nw_gapvar.py) so also to
implement amino acid specific gap penalties that reduce or increase the gap opening penalties
at each position in the alignment or sequence (this is referred to as Pascarella gaps). The
penalties should scale (multiply) the gep penalty. The values are in the file pascarella.txt. As
an example, positions that are rich in glycine are more likely to have an adjacent gap than
positions that are rich in valine.
Exercise 4 - 3 points. The script readtree.py prints all the leafs associated with each of the
nodes of a tree. Modify the script (save it as compare_trees.py) so that it counts the number
of nodes that occur in tree1.dnd but not in tree2.dnd. Report this number. Note that this
measure is named Robinson-Foulds distance and is often used in phylogeny to compare trees.
1) (1 point) similarity.py
Complete the script similarity.py so that it computes the similarity (in percentage)
between two protein sequences. Assume that both sequences have been previously
aligned and have the same length (considering gaps). The pairwise similarity between
the two sequences is the number of columns containing pairs with values >0 in the
substitution matrix divided by the total number of ungapped columns x100. Similarity ≥
identity.
2) (1 point) left_or_right.py
Modify the script readtree.py so that the function get_leaves() returns the names of the
left or the right childs (as requested; ex: get_leaves(nodes, “left”) should return only left
leaves).
With the provided example, the script should print:
['B', 'D', 'F']
ALGORITHMS FOR SEQUENCE ANALYSIS IN BIOINFORMATICS – FINAL EXAM 2023
1) (1 point) profile.py
Complete the script profile.py so that it computes a profile or position weight matrix.
This matrix displays the relative frequencies of each base in each of the positions in the
sequences in profile.fasta. All sequences have been aligned, have the same length and
contain no gaps. The output should be like:
A [0.3, 0.6, 0.1, 0, 0, 0.6, 0.7, 0.2, 0.1]
C [0.2, 0.2, 0.1, 0, 0, 0.2, 0.1, 0.1, 0.2]
G [0.1, 0.1, 0.7, 1, 0, 0.1, 0.1, 0.5, 0.1]
T [0.4, 0.1, 0.1, 0, 1, 0.1, 0.1, 0.2, 0.6]
2) (1 point) orphans.py
Modify the script orphans.py so that the function get_orphans() returns the names of
the leaves having no sister. With the provided example, the script should print:
['C', 'F']
3) (1 point) smith_waterman.py
a) the local alignment uses match and mismatch scores instead of a substitution matrix
b) the function sw() accepts two additional parameters (match and mismatch)
c) the algorithm implementation gives the following alignment with the provided
examples and the parameters match=2, mismatch=0 and gep=-1.
CAAT
CA-T