0% found this document useful (0 votes)
6 views

exam_programming_exercises

Uploaded by

Manuel Flores
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

exam_programming_exercises

Uploaded by

Manuel Flores
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Exercise 1 - 3 points. Modify the code of nw.

py so as count all the ties that occur while


filling up the dynamic programming matrix. You should only consider the ties
between the highest values (i.e. if Sub=4, Ins=2 and Del=2, this should not be counted as a
tie).

command: python nw.py prot_seqs.fasta matrix.lom


output: ties: <number of ties>

Exercise 2 - 3 points. In the file msa.fasta count the number of sequences containing 20 or
more basic residues (R, K and H)

command: python basic_residues.py msa.fasta


output: NSeq: <number of sequences containing more than 20 basic residues>

Exercise 3 - 4 points. Compute the pairwise identity between both sequences in file
align.fasta. The pairwise identity between two sequences is defined as the number of
columns containing identical residues divided by the total number of ungapped
columns.

command: python identity.py align.fasta


output: identity: <percentage of identity> %
Exercise 1 - 2 points. The script nw.py contains an incomplete implementation of the
Needleman–Wunsch algorithm. Modify the code so that it counts and prints the number of
gaps in each of the aligned sequences (seq1 and seq2) and the sum of both numbers.

command: python nw.py twoseq.fasta


output: gaps seq1: <number>
gaps seq2: <number>
total gaps: <number>

Exercise 2 - 2 points. The script msa.py performs an alignment of two sequence profiles. In
this version (the same we implemented in class), the score for one cell (one column of one
profile aligned to one column of the other profile) is set to the average of the matching scores
for the individual sequences while ignoring gaps. Modify the align function in msa.py so that
the average score takes gaps into account: the individual score of any amino acid aligned with
a gap should get the value of gep.

command: python msa.py prf1.msa prf2.msa matrix.lom


output: Optimal Score: <number>

Exercise 3 - 2 points. The script similarity.py computes the pairwise identity between two
sequences in a file in fasta format. Modify it so that it reads a substitution matrix and computes
the pairwise similarity as well (given as a percentage). The pairwise similarity between two
sequences is defined as the number of columns containing pairs with values >0 in the
substitution matrix divided by the total number of ungapped columns. Similarity ≥ identity.

command: python similarity.py align.fasta matrix.lom


output: identity: <percentage> %
similarity: <percentage> %

Exercise 4 - 2 points. The script readtree1.py prints the names of all leaves of a tree (one
line per node). Modify the code so that it also prints (together with the name; and also in the
same line) node id, parent node id and a string indicating whether the node is “right” or “left”
child of the parent node. Do not modify any function, only the main code.

command: python readtree1.py tree_prots.dnd


output: ...
TX7_ANTXA/3-47 27 26 right
...

Exercise 5 - 2 points. Modify the code in the previous exercise (make a copy of readtree1.py
and save it as readtree2.py) to remove from the nodes dictionary all nodes that are left-child
leaves. Print the number of nodes after removal. Do not modify any function, only the main
code.

command: python readtree2.py tree_prots.dnd


output: numb nodes: <number>
Exercise 1 - 2 points. Complete the script profile.py so that it computes a profile or position
weight matrix. This is a matrix that displays the relative frequencies of each base in each of
the positions in the sequences in profile.fasta. All sequences have the same length and
contain no gaps.

command: python profile.py profile.fasta


output: A [<list of numbers>]
C [<list of numbers>]
G [<list of numbers>]
T [<list of numbers>]

Exercise 2 - 2 points. The script nw.py contains an implementation of the Needleman–


Wunsch algorithm. Modify the code (save it as nw_rna.py) so as to be able to align the two
sequences in dna_rna.fasta. In the final alignment nucleotides that correspond to one another
in the DNA and the RNA should be aligned.

command: python nw_rna.py dna_rna.fasta


output: Optimal score: <number>
<alignment>

Exercise 3 - 3 points. Modify the script nw.py again (save it as nw_gapvar.py) so also to
implement amino acid specific gap penalties that reduce or increase the gap opening penalties
at each position in the alignment or sequence (this is referred to as Pascarella gaps). The
penalties should scale (multiply) the gep penalty. The values are in the file pascarella.txt. As
an example, positions that are rich in glycine are more likely to have an adjacent gap than
positions that are rich in valine.

command: python nw_gapvar.py gapvarseqs.fasta pascarella.txt


output: Optimal score: <number>
<alignment>

Exercise 4 - 3 points. The script readtree.py prints all the leafs associated with each of the
nodes of a tree. Modify the script (save it as compare_trees.py) so that it counts the number
of nodes that occur in tree1.dnd but not in tree2.dnd. Report this number. Note that this
measure is named Robinson-Foulds distance and is often used in phylogeny to compare trees.

command: python compare_trees.py tree1.dnd tree2.dnd


output: number of different nodes: <number>
ALGORITHMS FOR SEQUENCE ANALYSIS IN BIOINFORMATICS – FINAL EXAM 2022

1) (1.25 point) aligns.py


Modify the function naive_aligner() in aligns.py so that it generates all possible
alignments (and its scores) of seq1 with seq2 (assume seq1 is larger than seq2) by only
adding gaps to the beginning or end of seq2. As scoring system use: match: +1,
mismatch: ‐1, gap: 0.
The output of the script should be:
FASTCAT
CAT‐‐‐‐ ‐1
‐CAT‐‐‐ ‐1
‐‐CAT‐‐ ‐3
‐‐‐CAT‐ ‐3
‐‐‐‐CAT 3

2) (1.25 points) nw_gapvar.py


Modify the script needleman_wunsch.py to implement amino acid‐specific gap
penalties. This approach (Pascarella gaps) modifies the gap penalty by scaling gep. For
instance, Gly is more likely to have an adjacent gap than other amino acids and
multiplies by a factor < 1. Scaling factors are in pascarella.txt.

3) (1.25 points) readtree.py


Modify the script readtree.py by writing a function node_info(nodes, child), where the
argument child can take the values “left” or “right”. The function should return a list of
lists with information of only leaves (have no left or right child) that are left or right
childs (as requested).
The returned data should have the format:
[['TXAA_ANTXA', 28, 26],
['TXH8_ANTS7', 31, 29],
['TXH7_ANTS7', 34, 32],
['TX6_ANTXA', 35, 21],
['TXA2_ANTFU', 39, 37],
['TXAC_ANTEL', 40, 36],
...
elements in each nested list are i) protein name, ii) node id and iii) parent node id,
respectively.
ALGORITHMS FOR SEQUENCE ANALYSIS IN BIOINFORMATICS – RETAKE EXAM 2022

1) (1.25 point) aligns.py


Modify the function naive_aligner() in aligns.py and the match/mismatch scores so that
it generates all possible alignments (and its scores) of seq1 with seq2 (consider
alignments up to length seq1 + seq2 by adding gaps at the end of seq1 and both at the
beginning and end of seq2; assume seq1 shorter than seq2).
The output of the script should be:
FAST‐‐‐
AST‐‐‐‐ ‐6
‐AST‐‐‐ 9
‐‐AST‐‐ ‐6
‐‐‐AST‐ ‐6
‐‐‐‐AST ‐6

2) (1.25 points) nw_ties.py


Modify the code of nw_ties.py so as the function nw() returns the count of all the ties
that occur while filling up the dynamic programming matrix. You should only consider
ties between the highest scores (i.e. if subst=4, inser=2 and delet=2, this should not be
counted as a tie).
The script should print: 7

3) (1.25 points) msa.py


The script msa.py performs an alignment of two sequence profiles (two groups of
aligned sequences). In this version (similar to the one implemented in class), the score
for one cell (one column of one profile aligned to one column of the other profile) is set
to the average of the matching scores for the individual sequences while ignoring gaps.
Modify the align() function in msa.py so that i) it prints the optimal score and ii) the
average score for each cell takes gaps into account: the individual score of any amino
acid aligned with a gap should get the value of gep.
The optimal score after the modification should be 16.
ALGORITHMS FOR SEQUENCE ANALYSIS IN BIOINFORMATICS – FINAL EXAM 2023

1) (1 point) similarity.py
Complete the script similarity.py so that it computes the similarity (in percentage)
between two protein sequences. Assume that both sequences have been previously
aligned and have the same length (considering gaps). The pairwise similarity between
the two sequences is the number of columns containing pairs with values >0 in the
substitution matrix divided by the total number of ungapped columns x100. Similarity ≥
identity.

2) (1 point) left_or_right.py
Modify the script readtree.py so that the function get_leaves() returns the names of the
left or the right childs (as requested; ex: get_leaves(nodes, “left”) should return only left
leaves).
With the provided example, the script should print:
['B', 'D', 'F']
ALGORITHMS FOR SEQUENCE ANALYSIS IN BIOINFORMATICS – FINAL EXAM 2023

1) (1 point) profile.py
Complete the script profile.py so that it computes a profile or position weight matrix.
This matrix displays the relative frequencies of each base in each of the positions in the
sequences in profile.fasta. All sequences have been aligned, have the same length and
contain no gaps. The output should be like:
A [0.3, 0.6, 0.1, 0, 0, 0.6, 0.7, 0.2, 0.1]
C [0.2, 0.2, 0.1, 0, 0, 0.2, 0.1, 0.1, 0.2]
G [0.1, 0.1, 0.7, 1, 0, 0.1, 0.1, 0.5, 0.1]
T [0.4, 0.1, 0.1, 0, 1, 0.1, 0.1, 0.2, 0.6]

2) (1 point) orphans.py
Modify the script orphans.py so that the function get_orphans() returns the names of
the leaves having no sister. With the provided example, the script should print:
['C', 'F']

3) (1 point) smith_waterman.py

Modify smith_waterman.py so that:

a) the local alignment uses match and mismatch scores instead of a substitution matrix

b) the function sw() accepts two additional parameters (match and mismatch)

c) the algorithm implementation gives the following alignment with the provided
examples and the parameters match=2, mismatch=0 and gep=-1.
CAAT
CA-T

You might also like