0% found this document useful (0 votes)

13 views128 pages

Bioinformatics Alignment

The document discusses global pairwise alignment of nucleotide and amino-acid sequences, emphasizing the importance of homologous sequences and their evolutionary relationships. It outlines the concepts of orthology, paralogy, and xenology, and explains the methods and algorithms used for sequence alignment, including manual, dot matrix, and dynamic programming approaches. The document also highlights the significance of scoring matrices and gap penalties in determining optimal alignments between sequences.

Uploaded by

almondnathan400

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views128 pages

Bioinformatics Alignment

Uploaded by

almondnathan400

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 128

GLOBAL

PAIRWISE ALIGNMENT

GLOBAL ALIGNMENT OF:

2 NUCLEOTIDE SEQUENCES
OR
2 AMINO-ACID SEQUENCES
1
Assumptions:

Life is monophyletic
Biological entities (sequences,
taxa) share common ancestry

2
ancestor
Any two organisms
share a common
ancestor in their past

descendant 1 descendant 2 3
ancestor (~5 MYA)

4
ancestor (~120 MYA)

5
ancestor (~1,500 MYA)

6
(1) Speciation events
(2) Gene duplication
(3) Duplicative transposition

Homologous
sequences

7
Homolog
y: A term
coined by
Richard Owen
in 1843.

Definition:
Similarity
resulting from 8
Homology

There are three main types of

molecular homology: orthology,

paralogy (including ohnology) and

xenology.

9
Homology: General Definition

• Homology designates a qualitative

relationship of common descent
between entities
• Two genes are either homologous
or they are not!
– it doesn’t make sense to say “two
genes are 43% homologous.”
– it doesn’t make sense to say “Linda is
43% pregnant.”
10
Orthology & Paralogy
• Two genes are orthologs if they
originated from a single ancestral
gene in the most recent common
ancestor of their respective
genomes
• Two genes are paralogs if they are
related by gene duplication. Two
genes are ohnologs if they are
related by gene duplication due to
genome duplication 11
12
= Gene death

13
Xenology is due to horizontal
(lateral) gene transfer (HGT or
LGT)
XA and XB are xenologs
Distinguishing orthologs from xenologs is
impossible in pairwise genomic
comparisons, but possible when multiple
genomes are compared

14
Orthology, Paralogy, Xenology
(Fitch, Trends in Genetics, 2000. 16(5):227-231)

15
Homology

By comparing homologous
characters, we can reconstruct
the evolutionary events that have
led to the formation of the extant
sequences from the common
ancestor. 16
Homology

When comparing sequences, we are

interested in POSITIONAL HOMOLOGY.
We identify POSITIONAL HOMOLOGY
through SEQUENCE ALIGNMENT.
17
Alignment: A hypothesis
concerning positional
homology among residues
from two or more sequence.
Positional homology = In
pairwise alignment, a pair of
nucleotides from two
homologous sequences that
have descended from one
nucleotide in the ancestor of
the two sequences.
Sequence alignment involves the
identification of the correct location
of deletions and insertions that have
occurred in either of the two lineages
since their divergence from a
common ancestor.

19
20
Unknown sequence

Unknown events & unknown Unknown events & unknown

sequence of events sequence of events

The true alignment is

unknown.

21
There are two modes of alignment.

Global alignment: each residue of sequence A is

compared with each residue in sequence B. Global
alignment algorithms are used in comparative and
evolutionary studies.

Local alignment: Determining if sub-segments of

one sequence are present in another. Local
alignment methods have their greatest utility in
database searching and retrieval (e.g., BLAST).
For reasons of computational complexity, sequence
alignment is divided into two categories:

Pairwise alignment (i.e., the alignment of two

sequences).

Multiple-sequence alignment (i.e., the alignment of

three or more sequences).

Pairwise alignment problems have exact solutions.

Multiple-sequence alignment problems only have

approximate (heuristic) solutions.
A pairwise alignment consists of a
series of paired bases, one base from
each sequence. There are three types
of pairs:
(1) matches = the same nucleotide appears in
both sequences.
(2) mismatches = different nucleotides are
found in the two sequences.
(3) gaps = a base in one sequence and a null
base in the other.

GCGGCCCATCAGGTAGTTGGTG-G
GCGTTCCATC--CTGGTTGGTGTG
24
-Two DNA sequences: A and B.
-Lengths are m and n, respectively.

-The number of matched pairs is x.

-The number of mismatched pairs

is y.
- Total number of bases in gaps is
z.

25
There are internal and terminal
gaps.

GCGG-CCATCAGGTAGTTGGTG--
GCGTTCCATC--CTGGTTGGTGTG
26
A terminal gap may indicate
missing data.

GCGG-CCATCAGGTAGTTGGTG--
GCGTTCCATC--CTGGTTGGTGTG
27
An internal gap indicates that
a deletion or an insertion has
occurred in one of the two
lineages.

GCGG-CCATCAGGTAGTTGGTG--
GCGTTCCATC--CTGGTTGGTGTG
28
When sequences are compared through
alignment, it is impossible to tell
whether a deletion has occurred in one
sequence or an insertion has occurred
in the other. Thus, deletions and
insertions are collectively referred to as
indels (short for insertion or deletion).

GCGG-CCATCAGGTAGTTGGTG--
GCGTTCCATC--CTGGTTGGTGTG
29
The alignment is the first step
in many functional and
evolutionary studies.

Errors in alignment tend to

amplify in later stages of the
study.

30
Motivation for sequence
alignment

Function
– Similarity may be indicative of
similar function.

Evolution
– Similarity may be indicative of
common ancestry.

31
Some definitions

32
Methods of
alignment:

1. Manual
2. Dot matrix
3. Distance Matrix
4. Combined (Distance +
Manual)
34
Manual alignment.
nment When there
are few gaps and the two
sequences are not too
different from each other, a
reasonable alignment can
be obtained by visual
inspection.

GCG-TCCATCAGGTAGTTGGTGTG
GCGATCCATCAGGTGGTTGGTGTG
35
Advantages of manual alignment:

(1) use of a powerful and trainable tool

(the brain, well… some brains).

(2) ability to integrate additional data,

e.g., domain structure, biological
function.

36
37
Protein Alignment may be
guided by Secondary and
Tertiary Structures

Escherichia
coli Homo sapiens
DjlA protein DjlA protein

38
Disadvantages of manual alignment:

subjectivity (the algorithm is unspecified)

irreproducibility (the results cannot be

independently reproduced)

unscalability (inapplicable to long

sequences)

incommensurability (the results cannot

be compared to those obtained by other
39
The dot-matrix
method (Gibbs and
McIntyre, 1970): The
two sequences are
written out as column
and row headings of a
two-dimensional matrix.
A dot is put in the dot-
matrix plot at a position
where the nucleotides in
the two sequences are
identical.
40
The
alignment is
defined by a
path from the
upper-left
element to
the lower-
right
element.

41
There are 4 possible steps in the
path:
(1) a diagonal step
through a dot =
match.
(2) a diagonal step
through an empty
element of the matrix
= mismatch.
(3) a horizontal step = a
gap in the sequence
on the left of the
matrix.
(4) a vertical step = a
gap in the sequence
on the top of the 42
A dot matrix may become
cluttered. With DNA sequences,
~25% of the elements will be
occupied by dots by chance
alone. 43
window size =1
stringency = 1
alphabet size = 4

The number of spurious matches is determined by:

window size (how many residues are compared),
stringency (the minimum number of matches for
a hit), & alphabet size (number of characters
states). Window size must be an odd number. 44
window size =1 window size = 3
stringency = 1 stringency = 2
alphabet size = 4 alphabet size = 4
45
window size = 1
stringency = 1
alphabet size = 20
46
Dot-matrix methods:
Advantages: By being a visual
representation, and humans
being visual animals, the
method may unravel
information on the evolution of
sequences that cannot easily
be gleaned from a line
alignment.
Disadvantages: May not
identify the best possible 47
Window size = 60 amino acids; Stringency = 24 matches

Advantages:
Highlighting Information

The vertical gap indicates

that a coding region
corresponding to ~75
amino acids has either
been deleted from the
human gene or inserted
into the bacterial gene.

48
Window size = 60 amino acids; Stringency = 24 matches

Advantages:
Highlighting Information

The two pairs of

diagonally oriented
parallel lines most
probably indicate that two
small internal duplications
occurred in the bacterial
gene.

49
Disadvantages:

Not possible to
identify the
best alignment.

50
Scoring Matrices & Gap
Penalties

51
The true alignment between two sequences is
the one that reflects accurately the evolutionary
relationships between the sequences.

Since the true alignment is unknown, in practice

we look for the optimal alignment, which is the
one in which the numbers of mismatches and
gaps are minimized according to certain
criteria.
Unfortunately, reducing
the number of mismatches
results in an increase in
the number of gaps, and
vice versa.

53
 = matches
 = mismatches
 = nucleotides in gaps
 = gaps

54
The scoring scheme comprises a gap
penalty and a scoring matrix, M(a,b), that
specifies the score for each type of match (a = b)
or mismatch (a  b).

The units in a scoring matrix may be the

nucleotides in the DNA or RNA sequences, the
codons in protein-coding regions, or the amino
acids in protein sequences.

55
DNA scoring matrices are usually simple. In the
simplest scheme all mismatches are given the
same penalty.

M(a,b) is positive if a = b and negative otherwise.

 0 if a b
M(a,b)
 0 if a b
In more complicated matrices a distinction may be
made between transition and transversion
mismatches or each type of mismatch may be

penalized differently.
56
Further complications:
Distinguishing among
different matches and
mismatches.

For example, a mismatched pair

consisting of Leu & Ile,
Ile which are very
similar biochemically to each other,
may be given a lesser penalty than a
mismatched pair consisting of Arg &
Glu,
Glu which are very dissimilar from
each other.
57
Lesser penalty than

58
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

59
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

B = asx (asp or asn) X = unknown

Z = glx (glu or gln) * = termination codon 60
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

61
The matrix is symmetrical
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

Positive numbers on the diagonal 62

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

Mismatches are usually penalized 63

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

Some mismatches are not penalized 64

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

A few mismatches are even rewarded 65

Gap penalty (or cost) is a factor (or a
set of factors) by which the gap
values (numbers and lengths of gaps)
are mathematically manipulated to
make the gaps equivalent in value to
the mismatches.

The gap penalties are based on our

assessment of how frequent different
types of insertions and deletions
occur in evolution in comparison with
the frequency of occurrence of point
substitutions. 66
Mismatches
Gaps
The gap penalty has two
components: a gap-opening
penalty and a gap-extension
penalty.

68
Three main gap-penalty systems:
(1) Fixed gap-penalty system = 0 gap-extension
costs.

69
Three main gap-penalty systems:
(2) Linear gap-penalty system = the gap-extension cost is
calculated by multiplying the gap length minus 1 by a
constant representing the gap-extension penalty for
increasing the gap by 1.

70
Three main gap-penalty systems:
(3) Logarithmic gap-penalty system = the gap-
extension penalty increases with the logarithm
of the gap length, i.e., slower.

71
Alignment
algorithms

72
Aim: Given a
predetermined set of
criteria, find the
alignment associated
with the best score from
among all possible
alignments.

The OPTIMAL ALIGNMENT

73
The number of possible
alignments may be astronomical.
 n  m  (n  m)! n  m (n  m)n m
    n m
min(n,m) n!m! 2nm n m

where n and m are the lengths of

 the two sequences to be aligned.

74
The number of possible
alignments may be astronomical.

For example, when two DNA

sequences 200 residues long each
are compared, there are more
than 10153 possible alignments.

In comparison, the number of

protons in the universe is only
~1080.
75
FORTUNATELY:

There are computer algorithms

for finding the optimal
alignment between two
sequences that do not require
an exhaustive search of all the
possibilities.

76
The
Needleman-Wunsch (1970)
algorithm
uses
Dynamic
Programming

77
Dynamic programming = a
computational technique. It is
applicable when large searches can be
divided into a succession of small
stages, such that (1) the solution of
the initial search stage is trivial, (2)
each partial solution in a later stage
can be calculated by reference to only
a small number of solutions in an
earlier stage, and (3) the last stage
contains the overall solution.
78
Dynamic programming can be
applied to problems of
alignment because
ALIGNMENT SCORES obey the
following rules:
S S S
1 x, 1 y x 1, y1 1 x 1, 1 y1

79
Path Graph for aligning two
sequences

80
allowed

81
not allowed

82
Scoring scheme

match = +5
mismatch = –3
gap-opening penalty = –4
gap-extension penalty = 0

84
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0

Matrix initialization
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0

Matrix initialization
0 + match = 5
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0

Matrix initialization
0 + gap = –4
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0

Matrix fill
0 + match = 5
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0

Matrix fill
5 + gap = 1
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0

Matrix fill
0 + gap = –4
match = +5, mismatch = –3,
gap-opening penalty = –4, gap-extension penalty = 0