0% found this document useful (0 votes)
36 views18 pages

Alignments & Phylogenetic Trees: Lesk, A. 2 Ed

This document discusses sequence alignment and phylogenetic trees. It introduces sequence alignment as a tool to measure similarity between sequences, determine residue correspondences, and infer evolutionary relationships. It describes pairwise and multiple sequence alignments, and measures of sequence similarity including Hamming distance, Levenshtein distance, and scoring schemes like PAM matrices and BLOSUM matrices. Dot plots are described as a way to visualize pairwise sequence similarity.

Uploaded by

Sevs Lorilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views18 pages

Alignments & Phylogenetic Trees: Lesk, A. 2 Ed

This document discusses sequence alignment and phylogenetic trees. It introduces sequence alignment as a tool to measure similarity between sequences, determine residue correspondences, and infer evolutionary relationships. It describes pairwise and multiple sequence alignments, and measures of sequence similarity including Hamming distance, Levenshtein distance, and scoring schemes like PAM matrices and BLOSUM matrices. Dot plots are described as a way to visualize pairwise sequence similarity.

Uploaded by

Sevs Lorilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 18

Alignments & Phylogenetic Trees

Chapter 4
Lesk, A. 2nd Ed.
Introduction to Sequence Alignment

 Given 2 or more sequences, we initially wish


to
– Measure their similarity
– Determine residue-residue correspondences
– Observe patterns of conservation and variability
– Infer evolutionary relationships
 Major application is annotation of genomes,
involving alignment of structure and function
to as many genes as possible
Sequence Alignment

 Compare nucleotides and amino acids that


appear in corresponding positions in two or
more sequences – identification of residue-
residue correspondences
– Any assignment of correspondences that
preserves order of residues within sequences is
alignment
– Gaps may be introduced
 Basic tool of bioinformatics
Example

 Given two text strings:


– First string =abcde
– Second string = a c d e f
 Reasonable alignment would be
abcde–
a- cdef
Pairwise Sequence Alignments

 For sequences “gctgaacg” and “ctataatc”


Pairwise Sequence Alignments

 Optimal alignment may not be unique


– Several different alignments may give the same
best score
– Minor variations in scoring scheme may change
ranking of alignments, causing different one to
emerge as best
Multiple Sequence Alignment

 Mutual alignment of more than two


sequences
 Much more informative than pairwise
sequence alignments, in terms of revealing
patterns of conservation
Visual Alignment - Dotplot

 Simple picture that gives an overview of pairwise


sequence similarity
– Less obvious is its close relationship to alignments
 Table or matrix
– Rows correspond to residues of one sequence and columns
to residues of other sequence
– Positions in dotplot are left blank if residues are different,
and filled if they match
– Stretches of similar residues show up as diagonals in upper
left-lower right direction
Example
Continued … Dotplot

 Advantage
– Gives quick pictorial statement of the relationship between
two sequences
 Disadvantage
– Its ‘reach’ into the realm of distantly related sequences is
poor
 In analyzing sequences, one should always look at
dotplot to be sure of not missing anything obvious,
but be prepared to apply more subtle tools
Some Typical Dotplot
Comparisons

 Divergent sequences where only a segment is


homologous
 Long insertions and deletions
 Tandem repeats
– Square shape of pattern is characteristic of these repeats
Using Dotlet

 Dotlet is one of handiest tools for making dot


plots
 Dotlet is a Java applet
 Open and download applet at the following
site:
– www.isrec.isb-sib.ch/java/dotlet
 Use Firefox or IE (if one doesn’t work, use
the other)
Measures of Sequence Similarity

 Hamming distance
– Number of positions with mismatching characters
defined between two strings of equal length
 Levenshtein, or edit distance
– Minimal number of ‘edit operations’ required to
change one string into another between two
strings of not necessarily equal length
 Edit operation is deletion, insertion or alteration of single
character in either sequence
Examples

 agtc Hamming distance = 2


cgta
 ag-tcc Levenshtein distance = 3
cgctca

 Hamming and Levenshtein distances


measure dissimilarity of two sequences
– Similar sequences give small distances and
dissimilar sequences give large distances
Scoring Schemes

 A scoring system must account for residue


substitutions, and insertions or deletions
– An insertion, from one sequence’s point of view, is
a deletion as seen by the other
 Deletions, or gaps in a sequence, will have
scores that depend on their lengths
 Algorithms for optimal alignment can seek
either to minimize a dissimilarity measure, or
to maximize a scoring function
Scoring Schemes

 For nucleic acid sequences


– Common use of simple scheme for substitutions:
+1 for a match, -1 for a mismatch, or
– More complicated scheme based on higher
frequency of transition mutations (purine 
purine and pyrimidine  pyriimidine, a  g
and t  c) than transversion mutations
(purine  pyrimidine, (a or g)  (t or c))
Scoring Schemes

 For proteins
– A variety of scoring schemes have been proposed
 Dayhoff matrices or PAM (Percent Accepted Mutation) –
a measure of sequence divergence

– PAM 0 30 80 110 200 250


% identity 100 75 50 60 25 20

– PAM250 is appropriate level for practical work


Scoring Schemes

 BLOSUM matrices
– Developed by S. Henikoff and J.G. Henikof for scoring
substitutions in amino acid sequence comparison
– Goal was to replace Dayhoff matrix with one that would
perform best in identifying distant relationships by making
use of much larger amount of data that had become
available since Dayhoff’s work
– Based on BLOCKS database of aligned protein sequences,
hence the name BLOcks SUbstition Matrix
– BLOSUM62 commonly-used substitution matrix

You might also like