SECT 5 SL L1-Rev
SECT 5 SL L1-Rev
CSC8312
Prof. A. Wipat
Exam format
3 Questions – answer 2
Total 1.5 hours - 45 minutes each
Example Questions
CSC8312 Bioinformatics Theory and Applications 6
CSC8312 Bioinformatics Theory and Applications 7
Protein sequence divergence
• Orphans
• Single copy genes without any homolog
• Strain-specific expansions
• Gene families of paralogs without any orthologs
• Thus confined to the same family
• Xenologs
• Homologous genes where one gene has been obtained by horizontal gene
transfer. (transfer of genetic material between organisms)
• Comparative analysis of bacterial, archaeal, and eukaryotic genomes indicates
that a significant fraction of the genes in the prokaryotic genomes have been
subject to horizontal transfer.
Thus the PAM250 matrix represents a level of 250% change expected (over
2500 million years)
• Sequences at this level of divergence still have around 20% similarity
When used for protein comparison, the mutation probability (odds) matrix is
normalized and the logarithm is taken. (this lets us add the scores along a
protein instead of multiplying the probabilities)
C 12
S 0 2
T -2 1 3
P -3 1 0 6
A -2 1 1 1 2
G -3 1 0 -1 1 5
N -4 1 0 -1 0 0 2
D -5 0 0 -1 0 1 2 4
E -5 0 0 -1 0 0 1 3 4
Q -5 -1 -1 0 0 -1 1 2 2 4
H -3 -1 -1 0 -1 -2 2 1 1 3 6
R -4 0 -1 0 -2 -3 0 -1 -1 1 2 8
K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
L -8 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 8
V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
C S T P A G N D E Q H R K M I L V F Y W
The Dayhoff matrices were superceded by the BLOSUM matrices when more
sequence data became available
Replaced Dayhoff matrix with one that would perform better in identifying
distant relationships
They use the BLOCKS database to search for differences among sequences
but only among the very conserved regions of a protein family.
For each alignment the sequences similar at some threshold value of percent identity are
clustered into groups and averaged.
Substitution frequencies for all pairs of amino acids were calculated between the groups,
this was used to create the log-odds BLOSUM ( Block Substitution Matrix ).
Thus, BLOSUM62 means that the sequences clustered in this block are at least 62%
identical.
This allows detection of more distantly related sequences, as it downplays the role of the
more related sequences in the block when building the matrix.
Conserved
positions
CSC8312 Bioinformatics Theory and Applications 24
Inference of Structure from multiple sequence
alignments (See Lesk pg. 188)
http://www.russell.embl.de/aas