Algorithm Assignment
Algorithm Assignment
Your task is to write code for multiple sequence alignment. This is an individual assignment. Please
see the comments on rubrics and marking at the bottom.
You will read a text file containing a collection of protein sequences and output their alignment
together with its score. You will use BLOSUM50 as the substitution matrix and a linear gap penalty
with parameter d=8.
The steps will be as follows. The total marks for this will be out of 100.
Python code for pairwise alignment of sequences is in the attached zip file. You should make sure
you can unzip and run this code. Running the initial pairwise alignment file should return a
visualisation of a Needleman-Wunsch (NW) global alignment with its corresponding score for any
given sequences X and Y. (0 marks) Note: when running this for the first time you may see the
following: ModuleNotFoundError: No module named 'blosum'. This means you will need to import
that package using an installer like pip or conda. Ensure this is version 2.0.2 as this updated very
recently causing v1/2 code to be buggy.
Write a piece of code that reads a text file containing several protein sequences (such as those in
./sequences). It then takes each pair of sequences from that file and outputs a matrix that gives the
score of the optimal NW alignment between the two sequences. A sample output for file
multiple3.txt is provided. Submit the code for this part in part2.py. Note that diagonal entries align
sequences against themselves. (25 marks)
You will get code from us that computes the alignment of a sequence to a profile and its score.
You will adapt the code from (2) to compute distances using the Kimura model -- details below. You
should create a two-dimensional matrix whose (i, j)th entry is the Kimura distance between
sequence i and sequence j. (25 marks)
You will cluster all sequences using the distances in 3(i), using an existing python library -- details
below (20 marks)
You will use the clustering in (ii) to create the guide tree for a multiple alignment, and calculates a
multiple alignment and its score, and outputs it. Your code should be able to handle multiple128.txt.
Submit this as part 3a.py (20 marks)
You will try any other method to find a better score than what you got in 3(a). This method is for you
to decide. One option is to choose the closest pair of sequences from 3(i) and adding all other
sequences in the order of the average distance from those already in the profile (10 marks). Submit
this as part3b.py.
Your programs part3a.py and part3b.py each should read a file containing several protein sequences
(such as those in ./sequences), compute their multiple alignment, and output both the multiple
alignment and its score. An example output could be as follows (of course, your alignment and its
score could be different):
align X and Y
look at each column of the alignment, and ignore any columns with gaps
let positions_scored the number of remaining columns
let exact_matches be the number of columns where the two residues are exactly the same
Calculate the distance as follows:
S = exact_matches / positions_scored
D=1-S
distance = -ln( 1 - D - 0.2 D2) -- here ln is the natural log, which is log to the base e = 2.71828...
Clustering
For clustering, it is recommended that you use an agglomerative clustering method. Given n
"objects" (sequences in your case) and an n x n matrix of distances between these objects, it will
group together either:
two objects
an object with a group of objects
two groups of objects
until you get a single group. The grouping is based on distance. More details can be found here.
The little example below assumes five objects numbered 0 through 4, and the distances between
them are given by the array X. Here is code that does the clustering, with some hints on how to use
as a guide tree:
from sklearn.cluster import AgglomerativeClustering
import numpy as np
D = np.array([[0, 1, 2, 4, 4], [1, 0, 2, 4, 4], [2, 2, 0, 2, 2], [4, 4, 2, 0, 1], [4, 4, 2, 1, 0]])
model = AgglomerativeClustering(linkage='average', metric='precomputed', distance_threshold =
None)
#cluster the objects based on the distances
cluster = model.fit(D)
n_objects = len(cluster.labels_) # this is just the number of sequences
# cluster.children_ gives the hierarchical clustering
# leaves are the original objects and labelled from 0 to 4 here
# the internal nodes in the cluster are labelled from 5 onwards
next_node = n_objects
for i, merge in enumerate(cluster.children_):
print("Align ", merge[0], " with ", merge[1], " to give ", next_node)
next_node+=1
print("done")
Marking criteria
You are expected to do this code yourself. Collusion and plagiarism will be checked for.
Marks for each part are as follows:
- correctness of code (70% weight) -- does the code produce the right output, and does it look likely
it will work for all inputs?
- readability and clarity of code (30% weight). Please put ample comments that explain how your
code works.